The encoder extracts exact literal occurrences of <|endoftext|> first,
pre-tokenizes each remaining segment into UTF-8 byte chunks, and applies
the learned merges in rank order with a single non-overlapping
left-to-right pass per rank.
The decoder is a single expression: look each token id up in the vocab,
concatenate the bytes, decode once with UTF-8 strict mode. KeyError
and UnicodeDecodeError propagate unchanged.
The roundtrip invariant decode(encode(text)) == text holds by
construction: the encoder's total byte output for ordinary segments
equals text.encode("utf-8") for that segment, reserved special tokens
map to vocab entries containing the literal's UTF-8 bytes, and the
decoder joins and strict-decodes.
What lives here
File
Purpose
src/bpetite/_encoder.py
encode, _encode_ordinary, _apply_merge; all merge-application logic for the encode path
src/bpetite/_decoder.py
decode; one expression, no branching
src/bpetite/_pretokenizer.py
pretokenize; segments are split into UTF-8 byte chunks via the canonical regex before merge application
src/bpetite/_constants.py
END_OF_TEXT_TOKEN = "<|endoftext|>" and PRETOKENIZER_PATTERN; both are imported by the encoder transitively via the pre-tokenizer
tests/test_roundtrip.py
FR-17, FR-21, FR-22, FR-23, FR-25 coverage through the public API: empty-string edges, exception semantics, and the full roundtrip matrix
Key invariants
FR
Invariant
Consequence if violated
FR-16
The encoder first extracts exact literal occurrences of <|endoftext|>, then pre-tokenizes non-special segments into chunks, then applies learned merges by merge rank with non-overlapping LTR semantics.
Partial special strings get mis-extracted, or chunk bytes are double-merged, or rank order is violated; deterministic fixtures drift silently.
FR-17
encode("") returns [].
Callers must special-case empty input; downstream length assertions break.
FR-18
Partial special strings such as <|endoftext and endoftext|> are treated as ordinary text.
A false match consumes characters belonging to adjacent text; the roundtrip invariant breaks for any input containing a partial literal.
FR-19
Consecutive special tokens encode as consecutive special-token ids, one per occurrence.
Two <|endoftext|> in a row collapse into one id, or the second occurrence falls through into ordinary-text encoding.
FR-20
The decoder maps each token id to its bytes, concatenates, and decodes once with UTF-8 strict mode.
Multi-byte UTF-8 characters split across token boundaries decode incorrectly, or invalid bytes silently substitute replacement characters.
FR-21
decode([]) returns "".
Callers must special-case empty id sequences; downstream string assertions break.
FR-22
Unknown token ids raise KeyError.
Corrupt or out-of-range ids silently map to empty strings or substitution characters; the decoder hides data loss.
Invalid-byte sequences decode to U+FFFD substitutions under errors="replace"; the decoder hides corruption.
FR-24
Decoding a reserved special token id returns the literal string <|endoftext|>.
The reserved id decodes to garbled bytes or an empty string; round-trip breaks for any text containing the literal.
Walkthrough
The encode pipeline
text
input text (str)
|
v
split on exact literal <|endoftext|> <-- exact-literal match only; partial
| prefixes/suffixes flow through
v as ordinary text
[segment][special id][segment][special id]...
| |
v v
pretokenize() emit directly
| |
v |
list[bytes] chunks |
| |
v |
for each chunk: |
tokens = list(bytes) | <-- base byte ids 0..255, one per byte
for rank in 0..K-1: |
tokens = apply_merge(tokens, merges[rank], 256 + rank)
(non-overlapping left-to-right, single full pass per rank)
| |
v v
emit tokens into the final id sequence
Two design choices matter here. First,
special-token extraction happens before pre-tokenization, not inside it.
Keeping extraction outside the pre-tokenizer means the canonical GPT-2
regex never sees <|endoftext|> at all, so the literal cannot be
accidentally split by the regex's punctuation run or letter-run rules.
Second, merges are applied rank by rank, with a full non-overlapping
left-to-right pass at each rank, rather than by searching for the
currently-best pair on every iteration. Both strategies are equivalent for
correctness on the standard byte-level BPE merge list; the per-rank pass
matches the training-time replacement semantics from FR-10 verbatim and is
the style the task schema specifies.
Worked example
Train a minimal tokenizer on the two-word corpus "ab ab ab" at
vocab_size=258 (256 base bytes + up to 2 merges). This setup is small
enough to trace on paper and runnable as-is against the current repo.
python
frombpetiteimportTokenizertok=Tokenizer.train("ab ab ab",vocab_size=258)
The trainer pre-tokenizes the corpus into three chunks, counts pairs with
multiplicity, and learns exactly two merges before early-stopping:
chunks from pretokenize("ab ab ab") → [b"ab", b" ab", b" ab"]
rank 0: pair (97, 98) (b"a" + b"b") → new token id 256 with
bytes b"ab"
rank 1: pair (32, 256) (b" " + b"ab") → new token id 257 with
bytes b" ab"
Special-token extraction. The encoder scans the input text
left to right. At position 0 it checks text.startswith("<|endoftext|>", 0);
the input starts with "ab", so no match. It advances to position 2 and
finds the literal, emits id 258 for it, and records that segment
[0:2] = "ab" came before. It advances past the literal to position 15
and treats the remainder "ab" as the next segment.
Segment pre-tokenization. For the first segment "ab",
pretokenize("ab") returns [b"ab"], one chunk. Same for the trailing
"ab" segment.
Merge application on the first chunk. Start from the base-byte
id list [97, 98]. Rank 0 is (97, 98) → id 256; the full pass
finds one match at position 0, emits id 256, and advances. The
resulting token list for this chunk is [256]. Rank 1 is
(32, 256) → id 257; no match (no 32 adjacent to the 256). The
chunk is done: contributes [256].
Emit the special id in source order.[256] from the first
segment, then 258 for the special, then [256] from the trailing
segment.
Look each id up in the vocab. vocab[256] = b"ab",
vocab[258] = b"<|endoftext|>".encode("utf-8") (the literal's UTF-8 bytes,
stored at training time; see FR-24), vocab[256] = b"ab".
At position 0, "<|endoftext".startswith("<|endoftext|>") is False. The
input is strictly shorter than the literal. Special-token extraction makes
no match at any position in the 11-character input. The whole string flows
into _encode_ordinary, which pre-tokenizes it via the canonical regex
into two chunks (b"<|" and b"endoftext"), converts each chunk to base
byte ids, and applies the two ranked merges. Neither merge matches: the
chunks contain no (97, 98) pair and no (32, 256) pair. The final
token list is simply the original bytes, and the decoder joins them back
into the original string.
FR-18 is enforced exactly this way for all three partial variants
covered by the roundtrip suite: <|endoftext, endoftext|>, and <|endo.
The [a, b, a, b] invariant
A single merge rank scans the token list once and replaces every
non-overlapping match left to right. Concretely, for token list
[97, 98, 97, 98] (bytes for "abab") with merge (97, 98) → 256, one
pass emits [256, 256]. The cursor advances by two after each match so
overlapping candidates are never revisited. This is the same
non-overlapping LTR rule the trainer uses at merge time (FR-10), which is
why the encoder's output matches what the trainer would have produced on
the same bytes.
For three identical adjacent tokens [x, x, x] with merge (x, x), the
result is [xx, x], not [x, xx]: the first pair is consumed, the
cursor advances past the consumed pair, and the trailing x is emitted
unchanged. See
tests/test_trainer.py::test_train_non_overlapping_merge_on_triple_emits_merged_then_trailing
for the training-time pin on the same invariant.
No branching, no pre-loops, no special-case for the empty sequence
(b"".join(iter([])) is already b"", and b"".decode("utf-8") == "").
KeyError propagates out of the generator expression when an id is
missing from vocab. UnicodeDecodeError propagates out of the final
decode call when the concatenated bytes are not valid UTF-8. The strict
mode is explicit, not the Python default, because the PRD requires it
and a silent fallback to errors="replace" is the most common silent
failure for this kind of code.
Three encode/decode bugs are easy to miss: they pass the
obvious tests and surface only against specific constructions. The
roundtrip suite pins each one with a dedicated case.
Single-match-per-rank encoder bug. An implementation that scans for the
ranked pair once, applies one replacement, and advances to the next rank
still produces correct output when every pair has at most one match per
chunk. Inputs where the same pair appears multiple times in a single
chunk expose it. Any multi-match roundtrip case catches it in the suite;
the minimal reproducer is a chunk like [97, 98, 97, 98] with merge
(97, 98), which must emit [256, 256] in a single pass, not [256, 97, 98].
Partial-special false match. An implementation that uses str.find
or a regex for special extraction without bounding the match to the exact
literal can match <|endoftext|> starting inside a partial string like
"...<|endoftext|>..." but also silently against a prefix like
"<|endoftext". The three partial-special roundtrip cases, prefix,
suffix, and the shorter <|endo, expose this immediately because the
decoded output would drop the mismatched tail or consume too many
characters.
Decoder errors="replace" instead of errors="strict". Passing
errors="replace" to the final bytes.decode call silently substitutes
invalid byte sequences with U+FFFD instead of raising
UnicodeDecodeError. Every valid input still roundtrips, so this bug
never appears outside the invalid-UTF-8 test. The dedicated case uses
decode([0x80]), a lone continuation byte with no preceding start byte,
which is always invalid UTF-8 regardless of what merges were learned.
Related reading
Public Tokenizer API: how Tokenizer.encode and
Tokenizer.decode wrap _encoder.encode and _decoder.decode as
one-line delegations.
Roundtrip Suite: the full 55-test proof of
FR-17 through FR-25 via the public Tokenizer API only.
Phase 2 Core Algorithm: pair counting,
tie-breaking, and the training-time merge semantics that the encoder
mirrors at inference time.
Phase 2 Persistence: the vocab and merge
list shapes the encoder and decoder consume.