The five-method contract, private instance state, and delegation-only implementation of bpetite.Tokenizer.
6 min read·updated 2026-04-15·1324 words
Public Tokenizer API: five-method contract, delegation-only implementation
TL;DR
Tokenizer is the single public name exported from bpetite; the full
surface is five methods: train, encode, decode, save, load,
matching PRD lines 254–269 exactly.
Instance state is private (_vocab, _merges, _special_tokens); train
and load are classmethod factories, and encode always reads its text
parameter directly so a tokenizer never holds stored input across calls.
Every public method delegates to a private module, so the class body
contains no algorithmic logic and the published contract is stable under
internal refactors.
What lives here
File
Purpose
src/bpetite/_tokenizer.py
Tokenizer class; constructor stores normalized state, each public method is a one-line wrapper over the internal function
src/bpetite/__init__.py
Exports exactly one name, Tokenizer, via __all__ = ["Tokenizer"]; no convenience re-exports
src/bpetite/_trainer.py
train_bpe and TrainerResult; Tokenizer.train calls it and normalizes the result into mutable dict/list state
src/bpetite/_encoder.py
encode; Tokenizer.encode forwards the method argument text directly
src/bpetite/_decoder.py
decode; Tokenizer.decode forwards the method argument token_ids directly
src/bpetite/_persistence.py
save and load; Tokenizer.save and Tokenizer.load are delegation wrappers with identical exception semantics
Key invariants
Reference
Invariant
Consequence if violated
PRD §Public API Contract (lines 254–269)
The class exposes exactly five public methods: train, encode, decode, save, load. No additional public names.
Downstream code (including the Task 4-1 CLI) couples to nonexistent or renamed methods; the shipped API drifts from the PRD.
Task 3-3 AC2
from bpetite import Tokenizer resolves, and Tokenizer is the only public name on the bpetite module.
from bpetite import * leaks internal helpers; the public surface becomes unclear; internal renames break unwitting callers.
Task 3-3 implementation note
Tokenizer.encode passes its method argument text to the encoder; no stored attribute is read.
A cached self._text would make instances stateful; tok.encode("a"); tok.encode("b") would return the first call's IDs for the second.
PRD §Public API Contract
train and load are @classmethod factories returning "Tokenizer".
Calling on the class (not an instance) fails, loses type information, or returns the raw internal triple instead of a wrapped Tokenizer.
Phase 3 design
Every public method delegates to a private module; no algorithmic logic lives in the class body.
Internal refactors of _encoder.py or _persistence.py force Tokenizer edits; the class becomes a duplicate implementation surface.
FR-27 / FR-28 (via delegation)
save atomically writes through a same-directory temp file and raises FileExistsError when overwrite=False.
A crashed save leaves a partial artifact; a second training run silently overwrites a committed file.
FR-29 (via delegation)
load validates schema version, required keys, shapes, byte ranges, and the special-token invariants before returning.
A corrupt or hand-edited artifact loads without error and produces a broken tokenizer whose encode output no longer matches the training state.
Only Tokenizer is public. Every other name on the bpetite module is an
internal submodule (_encoder, _decoder, _tokenizer, _persistence,
_pretokenizer, _trainer, _constants) loaded as a side effect of the
import chain. They are underscore-prefixed and carry no backward-compatibility
guarantee.
train and load are classmethods, so their rendered signatures start
directly with the non-self arguments. The cls parameter is absorbed
by the descriptor.
End-to-end session
python
frompathlibimportPathfrombpetiteimportTokenizercorpus="the quick brown fox jumps over the lazy dog\n"*5tok=Tokenizer.train(corpus,vocab_size=300)ids=tok.encode("the quick brown fox")text=tok.decode(ids)asserttext=="the quick brown fox"artifact=Path("/tmp/bpetite-demo.json")artifact.unlink(missing_ok=True)tok.save(str(artifact))reloaded=Tokenizer.load(str(artifact))assertreloaded.encode("the quick brown fox")==idsassertreloaded.decode(ids)==text
Three observable properties fall out of this session:
Tokenizer.train returns a Tokenizer instance even though the underlying
train_bpe function returns a TrainerResult dataclass. The classmethod
normalizes TrainerResult.vocab (typed Mapping[int, bytes]) into a
mutable dict[int, bytes] and TrainerResult.merges (typed
tuple[tuple[int, int], ...]) into a mutable list[tuple[int, int]]
before handing off to __init__. The persistence layer accepts the
normalized shapes directly, so no further conversion is needed at save
time.
tok.encode("the quick brown fox") forwards the method argument text
unchanged to bpetite._encoder.encode. There is no stored-input cache:
calling tok.encode("another string") immediately afterward sees the new
input and produces the IDs for "another string", never those of the
previous call.
reloaded.encode(...) and reloaded.decode(...) return values identical
to the pre-save tokenizer. The roundtrip invariant (FR-25) holds through
the save/load boundary; see Roundtrip Suite for the
full 55-case proof.
Delegation, not reimplementation
Every public method is a one-line wrapper over its internal counterpart. For
reference, the encode path (src/bpetite/_tokenizer.py:78):
_encode is bpetite._encoder.encode, imported at module load time with an
underscore-prefixed alias so the class method does not shadow the private
function name inside the module. The same pattern applies to decode,
save, and load. The class body holds no algorithmic logic; if the
encoder's merge-application strategy changes, Tokenizer.encode does not
need to change at all.
Tokenizer.load of a corrupt artifact (missing key, bad shape, bad bytes)
KeyError or ValueError
FR-29 (via _persistence.load)
tests/test_persistence.py::test_load_rejects_* (the full rejection suite, 12 cases)
Silent failure specific to Tokenizer.encode
One silent failure has no automated test because it is a design-time
invariant enforced by the Task 3-3 implementation note:
Tokenizer.encode must call the encoder with the method argument text,
not any stored text attribute.
An implementation that caches the last encoded text on self._text and
reads from it inside encode still returns the correct IDs on any single
call. The cache is seeded from the method argument. Two consecutive calls
with different inputs silently return the first call's IDs for the second.
The invariant is enforced by the tight wrapper at
src/bpetite/_tokenizer.py:78; any change to that wrapper that references
self state for the input text must be rejected on sight during review.
Related reading
Encode and Decode: how Tokenizer.encode and
Tokenizer.decode actually produce and consume IDs, with a worked example
traced end-to-end through special-token extraction, pre-tokenization, and
per-rank merge application.
Roundtrip Suite: the 55-test proof of FR-25 against
the public API only, including the save/load parity coverage.
Phase 2 Persistence: the save/load
contract that Tokenizer.save and Tokenizer.load delegate to.
Phase 2 Core Algorithm: the train_bpe
contract that Tokenizer.train delegates to and normalizes into mutable
instance state.
docs/bpetite-prd-v2.md: FR-9, FR-16, FR-17,
FR-20, FR-21, FR-25 through FR-29; §Public API Contract, lines 254–269.