Phase 3: Encode, Decode, and Public API
Reading guide and vocabulary reference for the bpetite Phase 3 implementation.
Phase 3: Encode, Decode, and Public API
Phase 3 delivers the public surface of bpetite: the encoder that turns text
into token ids, the decoder that turns token ids back into text, the public
Tokenizer class that wires all four Phase 2 and Phase 3 modules behind a
five-method contract, and the roundtrip suite that proves
decode(encode(text)) == text through the public API only.
TL;DR
- The encoder extracts reserved special tokens by exact literal match, pre-tokenizes the remaining segments, and applies learned merges in rank order with single non-overlapping LTR passes (FR-16 through FR-19).
- The decoder is one expression: concatenate token bytes from the vocab,
then decode once with UTF-8 strict mode, propagating
KeyErrorandUnicodeDecodeErrorunchanged (FR-20 through FR-24). - Three area docs (
encode-decode,public-api,roundtrip-suite) cover separate areas; this index is the entry point and the single place where new Phase 3 vocabulary terms are defined.
What lives here
Phase 3 documentation
| Doc | Slug | What you learn |
|---|---|---|
| Encode and Decode | phase-3-encode-decode |
Special-token extraction, per-rank non-overlapping merge application, strict UTF-8 decode; worked example tracing encode("ab<|endoftext|>ab") end-to-end and back |
| Public Tokenizer API | phase-3-public-api |
The five-method contract, private instance state, classmethod factories, delegation-only implementation, and the __init__.py single-export rule |
| Roundtrip Suite | phase-3-roundtrip-suite |
The 55-test suite that proves FR-17–FR-26 against the public API, the two shared fixtures, and the decode([0x80]) UnicodeDecodeError construction |
Phase 3 source modules
| Module | Task | Role |
|---|---|---|
src/bpetite/_encoder.py |
3-1 | encode: special-token extraction, pre-tokenization, per-rank non-overlapping merge application |
src/bpetite/_decoder.py |
3-2 | decode: byte concatenation then strict UTF-8 decode; one expression, no branching |
src/bpetite/_tokenizer.py |
3-3 | Tokenizer class: delegation-only facade wrapping trainer, encoder, decoder, and persistence behind five public methods |
src/bpetite/__init__.py |
3-3 | from bpetite._tokenizer import Tokenizer; __all__ = ["Tokenizer"]: the single public export |
tests/test_roundtrip.py |
3-4 | 55-test public-API suite covering FR-17 through FR-26 and the save/load parity run |
tests/conftest.py |
3-4 | Session-scoped trained_tokenizer fixture shared across the roundtrip suite |
Key invariants
These invariants cut across all three Phase 3 areas. Each area doc covers its own FR-keyed invariant table in full detail.
| FR | Invariant | Area |
|---|---|---|
| FR-16 | The encoder extracts exact literal special tokens first, pre-tokenizes remaining segments, then applies merges in rank order with non-overlapping LTR passes. | Encode and Decode |
| FR-18 | Partial special strings such as <|endoftext and endoftext|> are treated as ordinary text and must roundtrip unchanged. |
Encode and Decode |
| FR-20 | The decoder concatenates token bytes then decodes once with UTF-8 strict mode; KeyError and UnicodeDecodeError propagate unchanged. |
Encode and Decode |
| FR-24 | Decoding the reserved special-token id returns the literal string <|endoftext|> via a vocab entry storing the literal's UTF-8 bytes. |
Encode and Decode |
| FR-25 | decode(encode(text)) == text for every supported input class, through the live tokenizer and through the save/load boundary. |
Roundtrip Suite + Phase 2 Persistence |
| PRD §Public API Contract | The public surface is exactly five methods on Tokenizer: train, encode, decode, save, load; Tokenizer is the only public name on bpetite. |
Public API |
Walkthrough
Recommended reading order
If you're new to Phase 3, read in this order:
- Encode and Decode: Read the encode
pipeline diagram and the worked example (
train_bpe("ab ab ab", 258)thenencode("ab<|endoftext|>ab")). This is the algorithmic heart of Phase 3. Budget 5–6 minutes. - Public Tokenizer API: Read the signature block, the end-to-end session, and the delegation-not-reimplementation section. Skim the failure modes table on a first pass. Budget 3–4 minutes.
- Roundtrip Suite: Skim the 15 parametrized
cases table, the two-fixture design, and the
[0x80]construction are the only non-obvious points. Budget 2 minutes.
A future contributor adding a feature or fixing a bug in Phase 3 should read the relevant area doc in full, including the failure modes table, before touching any Phase 3 source file.
Phase 3 in one paragraph
The encoder extracts reserved special tokens from the input by exact
literal match, pre-tokenizes each remaining text segment into UTF-8 byte
chunks via the canonical regex, and applies the learned merges in rank
order. At each rank, a single non-overlapping left-to-right pass replaces
every adjacent match in the chunk. The decoder is a single expression:
look each token id up in the vocab, concatenate the bytes, decode once
with UTF-8 strict mode. The public Tokenizer class wraps the trainer,
encoder, decoder, and persistence layer behind five methods whose
signatures match PRD lines 254–269 exactly; every method is a one-line
delegation to a private module, so the class body holds no algorithmic
logic. The roundtrip suite proves that
decode(encode(text)) == text for fifteen input classes (empty,
whitespace, ASCII, emoji, CJK, Arabic, mixed punctuation, the literal
<|endoftext|> in three configurations, three partial-special variants),
and the same fifteen cases are re-run against a saved-and-reloaded
tokenizer to prove the save/load boundary preserves encode output
byte-for-byte (FR-26).
Vocabulary reference
Phase 2 locked the project-wide vocabulary in
docs/phase-2/index.md. Phase 3
introduces three new locked terms used identically across all Phase 3
docs. Using a synonym for a locked term is a bug, not a style choice.
| Term | Definition |
|---|---|
segment |
A span of input text between reserved-special-token occurrences, or the whole input when no specials exist. Segments are pre-tokenized into chunks; they are not themselves chunks. |
special-token extraction |
The exact-literal matching pass the encoder performs before pre-tokenization, isolating reserved special tokens from ordinary text. Partial strings are never extracted; the match must equal the full literal. |
roundtrip invariant |
The FR-25 guarantee that decode(encode(text)) == text for every supported input class, through both the live Tokenizer instance and the save/load boundary. |
Failure modes
The four silent failure modes that Phase 3 docs call out by name. Each fails quietly under most tests and is caught only by the specific assertion listed.
| Failure | Silent because | Caught by | Doc |
|---|---|---|---|
| Single-match-per-rank encoder bug | Inputs with one match per ranked pair still roundtrip; only multi-match inputs expose divergence from training-time semantics | tests/test_roundtrip.py::test_roundtrip[<label>-...] on any multi-match case |
Encode and Decode |
| Partial-special false match | Inputs without partial specials roundtrip fine; only the three partial-special test cases expose it | tests/test_roundtrip.py::test_roundtrip[partial special prefix|suffix|short-...] |
Encode and Decode |
Decoder uses errors="replace" instead of errors="strict" |
Valid inputs still roundtrip; only the lone continuation byte test exposes it | tests/test_roundtrip.py::test_decode_invalid_utf8_raises |
Encode and Decode |
Tokenizer.encode reads a stored self._text instead of the method argument |
One call still returns the correct ids; only two consecutive calls with different inputs expose it | No dedicated automated test; code-level constraint at src/bpetite/_tokenizer.py:78 |
Public Tokenizer API |
Related reading
- Phase 2 Index: the core algorithm, persistence layer, and fixture set that Phase 3 delegates to; the project-wide vocabulary lock defined there is in force throughout Phase 3.
- Phase 2 Core Algorithm: the training
contract that
Tokenizer.trainwraps; the tie-breaking and non-overlapping merge semantics the encoder mirrors at inference time. - Phase 2 Persistence: the atomic save and
strict validating load that
Tokenizer.saveandTokenizer.loaddelegate to, and the save/load parity the roundtrip suite proves through the public API. docs/bpetite-prd-v2.md: FR-16 through FR-26 (encode, decode, roundtrip, save/load parity); §Public API Contract, lines 254–269.src/bpetite/: Phase 3 source modules.tests/test_roundtrip.py: the Phase 3 test suite.