all documents
Phase 3

Public Tokenizer API

The five-method contract, private instance state, and delegation-only implementation of bpetite.Tokenizer.

6 min read updated 2026-04-15 1324 words

Public Tokenizer API: five-method contract, delegation-only implementation

TL;DR

  • Tokenizer is the single public name exported from bpetite; the full surface is five methods: train, encode, decode, save, load, matching PRD lines 254–269 exactly.
  • Instance state is private (_vocab, _merges, _special_tokens); train and load are classmethod factories, and encode always reads its text parameter directly so a tokenizer never holds stored input across calls.
  • Every public method delegates to a private module, so the class body contains no algorithmic logic and the published contract is stable under internal refactors.

What lives here

File Purpose
src/bpetite/_tokenizer.py Tokenizer class; constructor stores normalized state, each public method is a one-line wrapper over the internal function
src/bpetite/__init__.py Exports exactly one name, Tokenizer, via __all__ = ["Tokenizer"]; no convenience re-exports
src/bpetite/_trainer.py train_bpe and TrainerResult; Tokenizer.train calls it and normalizes the result into mutable dict/list state
src/bpetite/_encoder.py encode; Tokenizer.encode forwards the method argument text directly
src/bpetite/_decoder.py decode; Tokenizer.decode forwards the method argument token_ids directly
src/bpetite/_persistence.py save and load; Tokenizer.save and Tokenizer.load are delegation wrappers with identical exception semantics

Key invariants

Reference Invariant Consequence if violated
PRD §Public API Contract (lines 254–269) The class exposes exactly five public methods: train, encode, decode, save, load. No additional public names. Downstream code (including the Task 4-1 CLI) couples to nonexistent or renamed methods; the shipped API drifts from the PRD.
Task 3-3 AC2 from bpetite import Tokenizer resolves, and Tokenizer is the only public name on the bpetite module. from bpetite import * leaks internal helpers; the public surface becomes unclear; internal renames break unwitting callers.
Task 3-3 implementation note Tokenizer.encode passes its method argument text to the encoder; no stored attribute is read. A cached self._text would make instances stateful; tok.encode("a"); tok.encode("b") would return the first call's IDs for the second.
PRD §Public API Contract train and load are @classmethod factories returning "Tokenizer". Calling on the class (not an instance) fails, loses type information, or returns the raw internal triple instead of a wrapped Tokenizer.
Phase 3 design Every public method delegates to a private module; no algorithmic logic lives in the class body. Internal refactors of _encoder.py or _persistence.py force Tokenizer edits; the class becomes a duplicate implementation surface.
FR-27 / FR-28 (via delegation) save atomically writes through a same-directory temp file and raises FileExistsError when overwrite=False. A crashed save leaves a partial artifact; a second training run silently overwrites a committed file.
FR-29 (via delegation) load validates schema version, required keys, shapes, byte ranges, and the special-token invariants before returning. A corrupt or hand-edited artifact loads without error and produces a broken tokenizer whose encode output no longer matches the training state.

Walkthrough

The exact public surface

python
>>> import bpetite
>>> sorted(name for name in dir(bpetite) if not name.startswith("_"))
['Tokenizer']
>>> bpetite.__all__
['Tokenizer']

Only Tokenizer is public. Every other name on the bpetite module is an internal submodule (_encoder, _decoder, _tokenizer, _persistence, _pretokenizer, _trainer, _constants) loaded as a side effect of the import chain. They are underscore-prefixed and carry no backward-compatibility guarantee.

The five-method signature block

python
>>> import inspect
>>> from bpetite import Tokenizer
>>> for name in ("train", "encode", "decode", "save", "load"):
...     print(f"{name:>7}: {inspect.signature(getattr(Tokenizer, name))}")
  train: (corpus: str, vocab_size: int) -> 'Tokenizer'
 encode: (self, text: str) -> list[int]
 decode: (self, token_ids: collections.abc.Sequence[int]) -> str
   save: (self, path: str, overwrite: bool = False) -> None
   load: (path: str) -> 'Tokenizer'

train and load are classmethods, so their rendered signatures start directly with the non-self arguments. The cls parameter is absorbed by the descriptor.

End-to-end session

python
from pathlib import Path
from bpetite import Tokenizer

corpus = "the quick brown fox jumps over the lazy dog\n" * 5
tok = Tokenizer.train(corpus, vocab_size=300)

ids = tok.encode("the quick brown fox")
text = tok.decode(ids)
assert text == "the quick brown fox"

artifact = Path("/tmp/bpetite-demo.json")
artifact.unlink(missing_ok=True)
tok.save(str(artifact))

reloaded = Tokenizer.load(str(artifact))
assert reloaded.encode("the quick brown fox") == ids
assert reloaded.decode(ids) == text

Three observable properties fall out of this session:

  1. Tokenizer.train returns a Tokenizer instance even though the underlying train_bpe function returns a TrainerResult dataclass. The classmethod normalizes TrainerResult.vocab (typed Mapping[int, bytes]) into a mutable dict[int, bytes] and TrainerResult.merges (typed tuple[tuple[int, int], ...]) into a mutable list[tuple[int, int]] before handing off to __init__. The persistence layer accepts the normalized shapes directly, so no further conversion is needed at save time.
  2. tok.encode("the quick brown fox") forwards the method argument text unchanged to bpetite._encoder.encode. There is no stored-input cache: calling tok.encode("another string") immediately afterward sees the new input and produces the IDs for "another string", never those of the previous call.
  3. reloaded.encode(...) and reloaded.decode(...) return values identical to the pre-save tokenizer. The roundtrip invariant (FR-25) holds through the save/load boundary; see Roundtrip Suite for the full 55-case proof.

Delegation, not reimplementation

Every public method is a one-line wrapper over its internal counterpart. For reference, the encode path (src/bpetite/_tokenizer.py:78):

python
def encode(self, text: str) -> list[int]:
    return _encode(text, self._merges, self._special_tokens)

_encode is bpetite._encoder.encode, imported at module load time with an underscore-prefixed alias so the class method does not shadow the private function name inside the module. The same pattern applies to decode, save, and load. The class body holds no algorithmic logic; if the encoder's merge-application strategy changes, Tokenizer.encode does not need to change at all.

Failure modes

Failure Exception type Reference Caught by
vocab_size < 256 passed to Tokenizer.train ValueError FR-9 (via train_bpe) tests/test_roundtrip.py::test_train_below_base_vocab_raises_value_error[-1|0|1|100|255]
Unknown token ID passed to Tokenizer.decode KeyError FR-22 (via _decoder.decode) tests/test_roundtrip.py::test_decode_unknown_id_raises_key_error
Invalid concatenated UTF-8 bytes in Tokenizer.decode UnicodeDecodeError FR-23 (via _decoder.decode) tests/test_roundtrip.py::test_decode_invalid_utf8_raises
Tokenizer.save to an existing path without overwrite=True FileExistsError FR-27 (via _persistence.save) tests/test_persistence.py::test_save_refuses_to_overwrite_existing_file_by_default
Tokenizer.save to a path whose parent directory does not exist FileNotFoundError FR-28 (via _persistence.save) tests/test_persistence.py::test_save_raises_file_not_found_when_parent_directory_missing
Tokenizer.load of a corrupt artifact (missing key, bad shape, bad bytes) KeyError or ValueError FR-29 (via _persistence.load) tests/test_persistence.py::test_load_rejects_* (the full rejection suite, 12 cases)

Silent failure specific to Tokenizer.encode

One silent failure has no automated test because it is a design-time invariant enforced by the Task 3-3 implementation note:

Tokenizer.encode must call the encoder with the method argument text, not any stored text attribute.

An implementation that caches the last encoded text on self._text and reads from it inside encode still returns the correct IDs on any single call. The cache is seeded from the method argument. Two consecutive calls with different inputs silently return the first call's IDs for the second. The invariant is enforced by the tight wrapper at src/bpetite/_tokenizer.py:78; any change to that wrapper that references self state for the input text must be rejected on sight during review.

  • Encode and Decode: how Tokenizer.encode and Tokenizer.decode actually produce and consume IDs, with a worked example traced end-to-end through special-token extraction, pre-tokenization, and per-rank merge application.
  • Roundtrip Suite: the 55-test proof of FR-25 against the public API only, including the save/load parity coverage.
  • Phase 2 Persistence: the save/load contract that Tokenizer.save and Tokenizer.load delegate to.
  • Phase 2 Core Algorithm: the train_bpe contract that Tokenizer.train delegates to and normalizes into mutable instance state.
  • docs/bpetite-prd-v2.md: FR-9, FR-16, FR-17, FR-20, FR-21, FR-25 through FR-29; §Public API Contract, lines 254–269.
  • src/bpetite/_tokenizer.py: the full Tokenizer class; ~130 lines total, no algorithmic logic.
  • src/bpetite/__init__.py: the export line that locks the public surface to Tokenizer only.