all documents
Phase 3

Roundtrip Suite

Parametrized cases, shared fixtures, and save/load parity design for the public-API roundtrip tests.

7 min read updated 2026-04-15 1574 words

Roundtrip Suite: 55 tests proving decode(encode(text)) == text through the public API

TL;DR

  • Fifty-five tests across seven functions in tests/test_roundtrip.py prove FR-17 through FR-26 through the public Tokenizer API only; no internal module is imported, and no token id is asserted directly.
  • Fifteen parametrized cases cover every required input class: empty, whitespace, ASCII, emoji, CJK, Arabic, mixed punctuation, the literal <|endoftext|> in three configurations, and the three partial-special variants. The same list drives both the live-tokenizer run and the save/load parity run.
  • Two shared fixtures keep the suite under 0.05 seconds: a session-scoped trained_tokenizer executes the training run exactly once per pytest invocation, and a module-scoped saved_and_loaded_tokenizer writes and reloads the artifact exactly once per test module.

What lives here

File Purpose
tests/test_roundtrip.py 55-test public-API suite; seven test functions, four of them parametrized over the shared ROUNDTRIP_CASES list
tests/conftest.py Session-scoped trained_tokenizer fixture: Tokenizer.train(tiny_corpus, vocab_size=260); constructed exactly once per pytest invocation
tests/fixtures/tiny.txt 212-byte ASCII training corpus (two pangrams × five lines); see Phase 2 Fixtures for the byte invariants
src/bpetite/_tokenizer.py The class under test; every assertion flows through its five public methods
src/bpetite/_persistence.py The save/load layer exercised by the module-scoped saved_and_loaded_tokenizer fixture

Key invariants

FR Invariant Consequence if violated
FR-25 decode(encode(text)) == text for every supported input class, asserted through the public Tokenizer API. Any bug in the encoder, decoder, or special-token extraction corrupts inputs silently; the round-trip contract fails.
FR-26 Save/load preserves encode output byte-for-byte; a reloaded tokenizer is interchangeable with its source. A committed artifact decodes to different ids than the training-time tokenizer; deployments diverge from training.
Task 3-4 design This file imports only from bpetite import Tokenizer. Internal imports belong to Phase 2 tests, not here. Coupling the suite to internal APIs hides regressions in the public contract; breakages in _tokenizer.py pass silently.
Task 3-4 design No test asserts specific token ids; only string identity after roundtrip. Brittle tests break on merge-order changes that are semantically equivalent; the suite becomes a maintenance tax.
pytest config tests/__init__.py does not exist; pytest runs in importlib import mode. importlib import mode silently breaks and relative-path tricks sneak into the suite.

Walkthrough

The 15 parametrized cases

The ROUNDTRIP_CASES list in tests/test_roundtrip.py:20 is shared across three parametrized test functions: test_roundtrip, test_post_load_encode_parity, and test_post_load_roundtrip. Adding a new input class means adding one entry to this list, not writing a new test function.

Case label Input Proves
empty string "" FR-17 / FR-21 roundtrip
single space " " whitespace preserved
whitespace only " \n\t " mixed whitespace preserved
ascii sentence "Hello, world!" ASCII + punctuation merges correctly
emoji "\U0001f389 party time \U0001f38a" 4-byte UTF-8 sequences preserved
cjk "\u4f60\u597d\u4e16\u754c" 3-byte CJK preserved
arabic "\u0645\u0631\u062d\u0628\u0627 \u0628\u0627\u0644\u0639\u0627\u0644\u0645" RTL, 2-byte Arabic preserved
mixed punctuation "...wait\u2014what?!" em dash plus ASCII punctuation runs
single endoftext "<|endoftext|>" FR-16 special extraction, FR-24 special decode
multiple non-consecutive specials "hello <|endoftext|> world <|endoftext|> again" multiple extraction points in one input
endoftext surrounded by unicode "\u524d <|endoftext|> \u5f8c" extraction around multi-byte segments
consecutive specials "<|endoftext|><|endoftext|>" FR-19 consecutive special ids in order
partial special prefix "<|endoftext" FR-18 prefix flows through
partial special suffix "endoftext|>" FR-18 suffix flows through
partial special short "<|endo" FR-18 short prefix flows through

The seven test functions and what they prove

Function Cases Asserts
test_roundtrip 15 trained_tokenizer.decode(trained_tokenizer.encode(text)) == text for every case
test_encode_empty_returns_empty_list 1 trained_tokenizer.encode("") == [] (FR-17 explicit)
test_decode_empty_returns_empty_string 1 trained_tokenizer.decode([]) == "" (FR-21 explicit)
test_decode_unknown_id_raises_key_error 1 trained_tokenizer.decode([999_999]) raises KeyError (FR-22)
test_decode_invalid_utf8_raises 1 trained_tokenizer.decode([0x80]) raises UnicodeDecodeError (FR-23)
test_train_below_base_vocab_raises_value_error 5 Tokenizer.train(tiny_corpus, bad) raises ValueError for bad in [-1, 0, 1, 100, 255] (FR-9)
test_post_load_encode_parity 15 saved_and_loaded_tokenizer.encode(text) == trained_tokenizer.encode(text) for every case (FR-26)
test_post_load_roundtrip 15 Full roundtrip invariant holds through the save/load boundary for every case (FR-25 + FR-26)
test_save_and_load_returns_tokenizer_instance 1 isinstance(Tokenizer.load(path), Tokenizer) sanity check

Fifteen parametrized cases across three parametrized functions, plus ten scalar tests, give 55 tests. That matches uv run pytest tests/test_roundtrip.py -v55 passed.

The two shared fixtures

The suite trains a tokenizer exactly once per pytest invocation and writes/reloads the artifact exactly once per test module, regardless of how many cases use them. The session-scoped fixture lives in the shared conftest so future Phase 3 or Phase 4 suites can reuse it; the module-scoped fixture lives inside test_roundtrip.py because it is local to this file's save/load parity design.

python
# tests/conftest.py
@pytest.fixture(scope="session")
def trained_tokenizer(tiny_corpus: str) -> Tokenizer:
    return Tokenizer.train(tiny_corpus, vocab_size=260)

vocab_size=260 yields up to 4 learned merges on top of the 256 base bytes. That is enough to exercise the multi-rank encode path without paying for a long training run; the tiny corpus early-stops before any large merge table can form.

python
# tests/test_roundtrip.py:48
@pytest.fixture(scope="module")
def saved_and_loaded_tokenizer(
    trained_tokenizer: Tokenizer,
    tmp_path_factory: pytest.TempPathFactory,
) -> Tokenizer:
    artifact_dir = tmp_path_factory.mktemp("roundtrip_artifact")
    artifact_path = artifact_dir / "tokenizer.json"
    trained_tokenizer.save(str(artifact_path))
    return Tokenizer.load(str(artifact_path))

Two details carry weight in this fixture. First, tmp_path_factory is used instead of tmp_path because the fixture is module-scoped; pytest's function-scoped tmp_path is not available in a module-scoped fixture. Second, the artifact is written and reloaded exactly once, then the resulting instance is reused across all thirty test_post_load_* cases. Every parity assertion is therefore against the same loaded tokenizer, so a save/load bug that flapped nondeterministically would fail all thirty cases consistently rather than a random subset.

The [0x80] UnicodeDecodeError construction

Testing FR-23 requires producing a sequence of token ids whose concatenated bytes form an invalid UTF-8 sequence. The suite uses a single base-byte id whose bytes are always invalid under strict UTF-8 decoding regardless of what merges were learned:

python
def test_decode_invalid_utf8_raises(trained_tokenizer: Tokenizer) -> None:
    with pytest.raises(UnicodeDecodeError):
        trained_tokenizer.decode([0x80])

Token id 0x80 (decimal 128) resolves to the single byte b"\x80". In UTF-8, 0x80 is a continuation byte; it is only valid when immediately preceded by a multi-byte lead byte (0xC20xF4). A lone continuation byte is malformed, so bytes.decode("utf-8", errors="strict") raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte.

The point of this construction is that it does not depend on any particular training run. No merge learned from tiny.txt can mint a token with these bytes, because the base vocab reserves id 128 for b"\x80" and the training process never mints a new token id ≤ 255. The assertion is stable across trainer implementations.

Save/load parity design

The parity run does not write a new artifact per case. One artifact is written by the module-scoped fixture at module setup time, then reloaded exactly once. The resulting saved_and_loaded_tokenizer is compared against the session-scoped trained_tokenizer via two independent assertions per case:

python
# tests/test_roundtrip.py:108
@pytest.mark.parametrize(("label", "text"), ROUNDTRIP_CASES)
def test_post_load_encode_parity(
    saved_and_loaded_tokenizer: Tokenizer,
    trained_tokenizer: Tokenizer,
    label: str,
    text: str,
) -> None:
    assert saved_and_loaded_tokenizer.encode(text) == trained_tokenizer.encode(text)

@pytest.mark.parametrize(("label", "text"), ROUNDTRIP_CASES)
def test_post_load_roundtrip(
    saved_and_loaded_tokenizer: Tokenizer, label: str, text: str
) -> None:
    assert (
        saved_and_loaded_tokenizer.decode(saved_and_loaded_tokenizer.encode(text))
        == text
    )

The encode_parity test is the strict one: it asserts that the two tokenizers produce byte-identical id lists for the same input, not just that both roundtrip. A save/load bug that perturbs merge order without breaking the roundtrip invariant would pass test_post_load_roundtrip but fail test_post_load_encode_parity.

Failure modes

Failure Exception type FR Caught by
Any roundtrip case does not return the original string AssertionError FR-25 test_roundtrip[<label>-...] for the offending case
Save/load produces a tokenizer whose encode output differs from the source AssertionError FR-26 test_post_load_encode_parity[<label>-...]
tests/__init__.py is accidentally created ImportError at collect n/a No automated test; pytest will fail to collect the whole suite
A new roundtrip case is added without also covering the save/load parity run Missing coverage FR-26 Code review: the three parametrized functions must share ROUNDTRIP_CASES
Internal module imported in test_roundtrip.py n/a n/a Code review: this file is public-API-only by contract
  • Public Tokenizer API: the five methods the suite exercises; the delegation-only contract this suite verifies end-to-end.
  • Encode and Decode: the algorithm underneath every roundtrip assertion; the per-rank merge and strict-UTF-8 decode paths.
  • Phase 2 Fixtures: the tiny.txt corpus consumed by the session-scoped trained_tokenizer fixture; byte invariants and the whitespace-preservation rule.
  • Phase 2 Persistence: the atomic save and strict load that test_post_load_* exercises through the public API.
  • docs/bpetite-prd-v2.md: FR-17, FR-21, FR-22, FR-23, FR-25, FR-26.
  • tests/test_roundtrip.py: the full 55-test suite.
  • tests/conftest.py: the session-scoped trained_tokenizer fixture and the existing fixture surface it joins.