v0.1 · deterministic · pure python

bpetite

a byte-level BPE tokenizer, written from scratch, built to be read.

Pure Python with one runtime dependency. Byte-perfect round-trips on every input: ASCII, Unicode, whitespace, empty strings, the reserved special token. Deterministic training. A single-file versioned artifact you can save, ship, and reload without behavioural drift.

01

what it is

bpetite is a local Python library and CLI that implements a deterministic byte-level BPE tokenizer from scratch. It trains on UTF-8 text with a GPT-2-style pre-tokenizer, encodes and decodes losslessly, and persists a versioned single-file artifact that reloads with byte-for-byte fidelity.

Every decision, including how pairs are counted, how ties break, how merges apply, and how the encoder walks a chunk, lives in readable Python. No C extensions, no Rust, no external tokenizer library. One runtime dependency beyond the standard library: regex, used exclusively for the GPT-2 pre-tokenizer pattern.

It is explicitly educational and local-only. Not a production tokenizer service. Not aiming for GPT-2 token-ID parity. The goal is a codebase a senior reviewer can read end-to-end and see real understanding rather than a toy script.

Language
Pure Python 3.12
Runtime dep
regex (pre-tokenizer only)
Determinism
Same corpus, same artifact
Round-trip
decode(encode(text)) == text, always
Artifact
Single versioned JSON file
API surface
One export: Tokenizer
Quality gate
pytest + ruff + mypy --strict
02

why I built it

Most engineers use tokenizers as opaque dependencies. tiktoken, sentencepiece, whatever ships with the model: you import it, you pass text in, you get integers out, and you never look inside. That is fine for building applications. It is weak for foundational ML engineering.

I wanted the tokenization layer to stop being a black box for me, so I rebuilt it. Not a port or a wrapper, but an implementation written from the byte level up, small enough to hold in my head and strict enough to prove it works. Deterministic training. Byte-perfect round-trips on every input: ASCII, whitespace, empty strings, UTF-8 emoji, the reserved special token. A versioned on-disk artifact that reloads without behavioural drift. mypy --strict clean. Every merge and every tie-break covered by tests.

The result is a codebase I can reason about completely, and that a reviewer can read end-to-end in an afternoon. That is the whole point.

03

documentation

auto-discovered from docs/ at build time. more will land as the codebase grows.

  1. 01
    Product Requirements Document
    The v1 scope, goals, non-goals, and functional requirements for bpetite.
    Reference 15 min read upd 2026-04-15
  2. 02
    Benchmarks
    Baseline training and encode-latency measurements for the bpetite v1 release, captured on the reference benchmark machine.
    Reference 5 min read upd 2026-04-15
  3. 03
    Phase 2: Core Algorithm, Fixtures, and Persistence
    Reading guide and vocabulary reference for the bpetite Phase 2 implementation.
    Phase 2 5 min read upd 2026-04-15
  4. 04
    Core Algorithm
    Pre-tokenizer, trainer, tie-breaking, merge application, early stop, and special-token reservation for bpetite.
    Phase 2 10 min read upd 2026-04-15
  5. 05
    Persistence and Artifact Schema v1
    Atomic save, deterministic serialization, and full loader validation for the bpetite tokenizer artifact.
    Phase 2 10 min read upd 2026-04-15
  6. 06
    Test Fixtures
    Purpose, byte invariants, whitespace-preservation rule, and conftest fixture surface for the bpetite test suite.
    Phase 2 6 min read upd 2026-04-15
  7. 07
    Phase 3: Encode, Decode, and Public API
    Reading guide and vocabulary reference for the bpetite Phase 3 implementation.
    Phase 3 6 min read upd 2026-04-15
  8. 08
    Encode and Decode
    Special-token extraction, per-rank merge application, and strict UTF-8 reconstruction for the bpetite encoder and decoder.
    Phase 3 9 min read upd 2026-04-15
  9. 09
    Public Tokenizer API
    The five-method contract, private instance state, and delegation-only implementation of bpetite.Tokenizer.
    Phase 3 6 min read upd 2026-04-15
  10. 10
    Roundtrip Suite
    Parametrized cases, shared fixtures, and save/load parity design for the public-API roundtrip tests.
    Phase 3 7 min read upd 2026-04-15
  11. 11
    Phase 4: CLI, Presentation, Tests, and Benchmark Harness
    Reading guide and vocabulary reference for the bpetite Phase 4 implementation.
    Phase 4 8 min read upd 2026-04-15
  12. 12
    CLI Contract
    Channel discipline, exit codes, JSON output shapes, argparse patterns, and progress-callback wiring for the bpetite CLI.
    Phase 4 13 min read upd 2026-04-15
  13. 13
    Rich Presentation Layer
    Shared stderr console, themed palette, panel helpers, interactive gating, and the plain-progress-line design decision for the bpetite CLI.
    Phase 4 12 min read upd 2026-04-15
  14. 14
    Benchmark Harness
    Encode-latency harness design, nearest-rank percentile math, defensive sentence-length check, and the elapsed_ms trainer span versus command wall clock distinction.
    Phase 4 13 min read upd 2026-04-15