Parametrized cases, shared fixtures, and save/load parity design for the public-API roundtrip tests.
7 min read·updated 2026-04-15·1574 words
Roundtrip Suite: 55 tests proving decode(encode(text)) == text through the public API
TL;DR
Fifty-five tests across seven functions in tests/test_roundtrip.py prove
FR-17 through FR-26 through the public Tokenizer API only; no internal
module is imported, and no token id is asserted directly.
Fifteen parametrized cases cover every required input class: empty,
whitespace, ASCII, emoji, CJK, Arabic, mixed punctuation, the literal
<|endoftext|> in three configurations, and the three partial-special
variants. The same list drives both the live-tokenizer run and the
save/load parity run.
Two shared fixtures keep the suite under 0.05 seconds: a session-scoped
trained_tokenizer executes the training run exactly once per pytest
invocation, and a module-scoped saved_and_loaded_tokenizer writes and
reloads the artifact exactly once per test module.
What lives here
File
Purpose
tests/test_roundtrip.py
55-test public-API suite; seven test functions, four of them parametrized over the shared ROUNDTRIP_CASES list
tests/conftest.py
Session-scoped trained_tokenizer fixture: Tokenizer.train(tiny_corpus, vocab_size=260); constructed exactly once per pytest invocation
tests/fixtures/tiny.txt
212-byte ASCII training corpus (two pangrams × five lines); see Phase 2 Fixtures for the byte invariants
src/bpetite/_tokenizer.py
The class under test; every assertion flows through its five public methods
src/bpetite/_persistence.py
The save/load layer exercised by the module-scoped saved_and_loaded_tokenizer fixture
Key invariants
FR
Invariant
Consequence if violated
FR-25
decode(encode(text)) == text for every supported input class, asserted through the public Tokenizer API.
Any bug in the encoder, decoder, or special-token extraction corrupts inputs silently; the round-trip contract fails.
FR-26
Save/load preserves encode output byte-for-byte; a reloaded tokenizer is interchangeable with its source.
A committed artifact decodes to different ids than the training-time tokenizer; deployments diverge from training.
Task 3-4 design
This file imports only from bpetite import Tokenizer. Internal imports belong to Phase 2 tests, not here.
Coupling the suite to internal APIs hides regressions in the public contract; breakages in _tokenizer.py pass silently.
Task 3-4 design
No test asserts specific token ids; only string identity after roundtrip.
Brittle tests break on merge-order changes that are semantically equivalent; the suite becomes a maintenance tax.
pytest config
tests/__init__.py does not exist; pytest runs in importlib import mode.
importlib import mode silently breaks and relative-path tricks sneak into the suite.
Walkthrough
The 15 parametrized cases
The ROUNDTRIP_CASES list in tests/test_roundtrip.py:20 is shared across
three parametrized test functions: test_roundtrip, test_post_load_encode_parity,
and test_post_load_roundtrip. Adding a new input class means adding one
entry to this list, not writing a new test function.
Fifteen parametrized cases across three parametrized functions, plus ten
scalar tests, give 55 tests. That matches
uv run pytest tests/test_roundtrip.py -v → 55 passed.
The two shared fixtures
The suite trains a tokenizer exactly once per pytest invocation and
writes/reloads the artifact exactly once per test module, regardless of
how many cases use them. The session-scoped fixture lives in the shared
conftest so future Phase 3 or Phase 4 suites can reuse it; the
module-scoped fixture lives inside test_roundtrip.py because it is
local to this file's save/load parity design.
vocab_size=260 yields up to 4 learned merges on top of the 256 base
bytes. That is enough to exercise the multi-rank encode path without
paying for a long training run; the tiny corpus early-stops before any
large merge table can form.
Two details carry weight in this fixture. First, tmp_path_factory is
used instead of tmp_path because the fixture is module-scoped; pytest's
function-scoped tmp_path is not available in a module-scoped fixture.
Second, the artifact is written and reloaded exactly once, then the
resulting instance is reused across all thirty test_post_load_* cases.
Every parity assertion is therefore against the same loaded tokenizer,
so a save/load bug that flapped nondeterministically would fail all
thirty cases consistently rather than a random subset.
The [0x80] UnicodeDecodeError construction
Testing FR-23 requires producing a sequence of token ids whose
concatenated bytes form an invalid UTF-8 sequence. The suite uses a
single base-byte id whose bytes are always invalid under strict UTF-8
decoding regardless of what merges were learned:
Token id 0x80 (decimal 128) resolves to the single byte b"\x80". In
UTF-8, 0x80 is a continuation byte; it is only valid when immediately
preceded by a multi-byte lead byte (0xC2–0xF4). A lone continuation
byte is malformed, so bytes.decode("utf-8", errors="strict") raises
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte.
The point of this construction is that it does not depend on any
particular training run. No merge learned from tiny.txt can mint a
token with these bytes, because the base vocab reserves id 128 for
b"\x80" and the training process never mints a new token id ≤ 255.
The assertion is stable across trainer implementations.
Save/load parity design
The parity run does not write a new artifact per case. One artifact is
written by the module-scoped fixture at module setup time, then reloaded
exactly once. The resulting saved_and_loaded_tokenizer is compared
against the session-scoped trained_tokenizer via two independent
assertions per case:
The encode_parity test is the strict one: it asserts that the two
tokenizers produce byte-identical id lists for the same input, not just
that both roundtrip. A save/load bug that perturbs merge order without
breaking the roundtrip invariant would pass test_post_load_roundtrip
but fail test_post_load_encode_parity.
Failure modes
Failure
Exception type
FR
Caught by
Any roundtrip case does not return the original string
AssertionError
FR-25
test_roundtrip[<label>-...] for the offending case
Save/load produces a tokenizer whose encode output differs from the source
AssertionError
FR-26
test_post_load_encode_parity[<label>-...]
tests/__init__.py is accidentally created
ImportError at collect
n/a
No automated test; pytest will fail to collect the whole suite
A new roundtrip case is added without also covering the save/load parity run
Missing coverage
FR-26
Code review: the three parametrized functions must share ROUNDTRIP_CASES
Internal module imported in test_roundtrip.py
n/a
n/a
Code review: this file is public-API-only by contract
Related reading
Public Tokenizer API: the five methods the suite
exercises; the delegation-only contract this suite verifies end-to-end.
Encode and Decode: the algorithm underneath every
roundtrip assertion; the per-rank merge and strict-UTF-8 decode paths.
Phase 2 Fixtures: the tiny.txt corpus
consumed by the session-scoped trained_tokenizer fixture; byte
invariants and the whitespace-preservation rule.
Phase 2 Persistence: the atomic save and
strict load that test_post_load_* exercises through the public API.