bpetite
a byte-level BPE tokenizer, written from scratch, built to be read.
Pure Python with one runtime dependency. Byte-perfect round-trips on every input: ASCII, Unicode, whitespace, empty strings, the reserved special token. Deterministic training. A single-file versioned artifact you can save, ship, and reload without behavioural drift.
what it is
bpetite is a local Python library and CLI that implements a deterministic byte-level BPE tokenizer from scratch. It trains on UTF-8 text with a GPT-2-style pre-tokenizer, encodes and decodes losslessly, and persists a versioned single-file artifact that reloads with byte-for-byte fidelity.
Every decision, including how pairs are counted, how ties break, how merges apply, and how the encoder walks a chunk, lives in readable Python. No C extensions, no Rust, no external tokenizer library. One runtime dependency beyond the standard library: regex, used exclusively for the GPT-2 pre-tokenizer pattern.
It is explicitly educational and local-only. Not a production tokenizer service. Not aiming for GPT-2 token-ID parity. The goal is a codebase a senior reviewer can read end-to-end and see real understanding rather than a toy script.
- Language
- Pure Python 3.12
- Runtime dep
- regex (pre-tokenizer only)
- Determinism
- Same corpus, same artifact
- Round-trip
- decode(encode(text)) == text, always
- Artifact
- Single versioned JSON file
- API surface
- One export: Tokenizer
- Quality gate
- pytest + ruff + mypy --strict
why I built it
Most engineers use tokenizers as opaque dependencies. tiktoken, sentencepiece, whatever ships with the model: you import it, you pass text in, you get integers out, and you never look inside. That is fine for building applications. It is weak for foundational ML engineering.
I wanted the tokenization layer to stop being a black box for me, so I rebuilt it. Not a port or a wrapper, but an implementation written from the byte level up, small enough to hold in my head and strict enough to prove it works. Deterministic training. Byte-perfect round-trips on every input: ASCII, whitespace, empty strings, UTF-8 emoji, the reserved special token. A versioned on-disk artifact that reloads without behavioural drift. mypy --strict clean. Every merge and every tie-break covered by tests.
The result is a codebase I can reason about completely, and that a reviewer can read end-to-end in an afternoon. That is the whole point.
documentation
auto-discovered from docs/ at build time. more will land as the codebase grows.
-
01
Product Requirements DocumentThe v1 scope, goals, non-goals, and functional requirements for bpetite.
-
02
BenchmarksBaseline training and encode-latency measurements for the bpetite v1 release, captured on the reference benchmark machine.
-
03
Phase 2: Core Algorithm, Fixtures, and PersistenceReading guide and vocabulary reference for the bpetite Phase 2 implementation.
-
04
Core AlgorithmPre-tokenizer, trainer, tie-breaking, merge application, early stop, and special-token reservation for bpetite.
-
05
Persistence and Artifact Schema v1Atomic save, deterministic serialization, and full loader validation for the bpetite tokenizer artifact.
-
06
Test FixturesPurpose, byte invariants, whitespace-preservation rule, and conftest fixture surface for the bpetite test suite.
-
07
Phase 3: Encode, Decode, and Public APIReading guide and vocabulary reference for the bpetite Phase 3 implementation.
-
08
Encode and DecodeSpecial-token extraction, per-rank merge application, and strict UTF-8 reconstruction for the bpetite encoder and decoder.
-
09
Public Tokenizer APIThe five-method contract, private instance state, and delegation-only implementation of bpetite.Tokenizer.
-
10
Roundtrip SuiteParametrized cases, shared fixtures, and save/load parity design for the public-API roundtrip tests.
-
11
Phase 4: CLI, Presentation, Tests, and Benchmark HarnessReading guide and vocabulary reference for the bpetite Phase 4 implementation.
-
12
CLI ContractChannel discipline, exit codes, JSON output shapes, argparse patterns, and progress-callback wiring for the bpetite CLI.
-
13
Rich Presentation LayerShared stderr console, themed palette, panel helpers, interactive gating, and the plain-progress-line design decision for the bpetite CLI.
-
14
Benchmark HarnessEncode-latency harness design, nearest-rank percentile math, defensive sentence-length check, and the elapsed_ms trainer span versus command wall clock distinction.