Every machine-readable result (train JSON summary, encode compact JSON array,
decode raw text) is written via sys.stdout.write with no Rich involvement. Every
banner, panel, progress line, and error panel is rendered through a shared Rich
Console constructed with stderr=True. A single stray print() call on the wrong
channel fails the contract test suite immediately.
The train progress callback is threaded through internal train_bpe, not through
public Tokenizer.train. FR-30 pins the public method signature to
(corpus, vocab_size) exactly, so the CLI calls train_bpe directly with a callback
closure and wraps the returned TrainerResult in a Tokenizer via the existing
constructor.
Subprocess-level tests in tests/test_cli.py drive the installed console entry point
through Path(sys.executable).parent / "bpetite" rather than a second uv run layer,
preserving strict stdout/stderr separation and avoiding nested venv resolution.
What lives here
File
Purpose
src/bpetite/_cli.py
main, _build_parser, the three subcommand handlers (_cmd_train, _cmd_encode, _cmd_decode), the _train_with_progress callback wiring, every error router, every stdout write
src/bpetite/_trainer.py
train_bpe(corpus, vocab_size, *, progress) and the ProgressEvent dataclass the CLI imports; the internal entry point the CLI uses for the callback-enabled training path
src/bpetite/_ui.py
The shared stderr Console and the panel helpers the handlers render through (documented in full at Rich Presentation Layer)
tests/test_cli.py
13 subprocess-level contract tests covering every happy path, every runtime failure mode, and strict channel separation; two session-scoped fixtures (trained artifact and progress corpus)
tests/conftest.py
tiny_corpus_path session fixture shared with the CLI tests; returns the filesystem path to tests/fixtures/tiny.txt
Key invariants
FR
Invariant
Consequence if violated
FR-34
Machine-readable command results are written to stdout only. Written via sys.stdout.write, never through Rich, never interleaved with stderr content.
Downstream scrapers that consume bpetite encode output break when stdout contains any bytes other than the JSON array. The contract is byte-exact.
FR-33
CLI errors are written to stderr and return non-zero exit codes. Every runtime failure catches a specific exception type and routes through _fail to sys.exit(1).
Uncaught exceptions print a Python traceback to stderr. The channel is still correct, but stack frames leak internal implementation state.
FR-32
The CLI exposes the explicit subcommand set train, encode, decode. Subparsers are required=True; invoking bare bpetite exits with argparse's standard code 2.
A missing subcommand silently dispatches nothing and exits 0, hiding misconfiguration from CI and shell scripts.
FR-30
Tokenizer.train(corpus, vocab_size) is the pinned public signature. The progress callback must not be added to this method.
Adding a progress keyword argument to the public method rewrites the public API contract; every published artifact and every downstream user pins on the current signature.
(local)
Subprocess tests invoke Path(sys.executable).parent / "bpetite" directly, not uv run bpetite.
A nested uv run layer triggers a second venv resolution that can emit stderr warnings, deadlock under certain OS conditions, and slow the suite unnecessarily.
(local)
Every error message on stderr is grep-friendly: title, body message, and optional recovery hint, all as plain text through render_error rather than raw Python tracebacks.
Debuggers reading CI logs resort to pattern-matching against traceback lines, which are fragile across Python minor versions.
Walkthrough
The CLI in one diagram
text
argv
|
v
main() in _cli.py
|
v
_build_parser() -> ArgumentParser(prog="bpetite") with required subparsers
|
v
args = parser.parse_args()
|
v
dispatch on args.command:
train -> _cmd_train(args)
encode -> _cmd_encode(args)
decode -> _cmd_decode(args)
|
v
subcommand handler does its work and ends with:
sys.stdout.write(<machine-readable result>)
sys.stdout.flush()
Everywhere else:
_ui.render_banner(), _ui.render_kv_box(), _ui.render_error()
-> Console(stderr=True) -> stderr
Only the three sys.stdout.write sites at the end of each handler ever touch stdout.
Every other write, including configuration panels, progress lines, the completion panel,
and error panels, routes through _ui.console, which is a stderr-only Console.
train traced end-to-end
A single uv run bpetite train --input data/tinyshakespeare.txt --vocab-size 512 --output out.json invocation runs this sequence:
Argparse.main calls _build_parser().parse_args(). The train subparser is
required, so missing --input, --vocab-size, or --output triggers argparse's
default exit code 2 with a usage message on stderr. The CLI does not override this.
Dispatch.main dispatches on args.command == "train" and calls _cmd_train.
Banner + configuration panel (stderr)._cmd_train calls render_banner() and
then render_kv_box with the four configured values (input path, vocab size, output
path, force flag). Both render to the shared stderr Console. The banner only
appears when the terminal is fully interactive and at least 95 columns wide; see
Rich Presentation Layer for the gating rules.
Output path preflight._check_output_path fails fast if the destination exists
without --force, or if its parent directory is missing. Both failures route through
_fail and exit 1 before any training work starts.
Corpus read._read_corpus_or_exit(args.input) reads the file as bytes and
decodes as strict UTF-8. FileNotFoundError, OSError, and UnicodeDecodeError each
route to a distinct _fail message on stderr.
Trainer span starts.t0 = time.perf_counter() at _cli.py:112. This is the
boundary of elapsed_ms (see trainer span in the index vocabulary).
_train_with_progress with callback. The CLI calls _train_with_progress, which
invokes internal train_bpe with a local closure that renders each ProgressEvent
as a plain styled console.print line on stderr. See the progress-callback section
below for the full wiring rationale.
Trainer span ends.elapsed_ms = round((time.perf_counter() - t0) * 1000, 2) at
_cli.py:114. The timer stops immediately after training and before any artifact
save.
Tokenizer wrap. The TrainerResult is packed into a Tokenizer via the
existing constructor (Tokenizer(vocab=..., merges=..., special_tokens=...)). No
separate classmethod is needed because the constructor already accepts the raw
state; see Public Tokenizer API
for why the class ships a single wrapping path.
Atomic save._save_or_exit(tokenizer, args.output, overwrite=bool(args.force))
delegates to _persistence.save. FileExistsError, FileNotFoundError, and
generic OSError each route to a distinct _fail message.
Completion panel (stderr).render_kv_box with a green border renders the final
summary panel: corpus bytes, requested vocab size, actual mergeable vocab size,
special-token count, elapsed ms, saved path.
JSON summary (stdout). The one machine-readable write:
sys.stdout.write(json.dumps(summary) + "\n"); sys.stdout.flush(). Five keys,
default JSON separators, one trailing newline.
Every step between 1 and 12 that renders anything does so on stderr. The only stdout
write is step 12.
Machine-readable output shapes
The three subcommands each emit exactly one machine-readable payload. The exact shape
matters because downstream tooling pins against it.
corpus_bytes (int): UTF-8 byte length of the input corpus.
requested_vocab_size (int): the --vocab-size value passed on the command line.
actual_mergeable_vocab_size (int): the learned mergeable vocabulary size after
training. Equals 256 + len(merges). May be smaller than requested_vocab_size if
early-stop fired.
special_token_count (int): number of reserved special tokens. Always 1 in v1
(exactly <|endoftext|>).
elapsed_ms (float): trainer span in milliseconds, rounded to two decimal places.
Wraps only _train_with_progress. Not the command wall clock; see
Benchmark Harness.
The JSON uses default separators (, and : ). Default separators are intentional: a
five-field object is readable to a human scanning CI logs, and the outer contract is key
presence and types, not byte-exact compactness.
encode. A compact JSON array of token ids:
json
[72,408,111,44,263,270,312,33]
The compactness is byte-exact: json.dumps(ids, separators=(",", ":")). No spaces, no
trailing newline beyond the one the CLI appends. Downstream code calling
json.loads(result.stdout) parses this directly.
decode. Raw decoded text, written with sys.stdout.write(text):
text
Hello, world!
No JSON wrapper, no label, no trailing newline added beyond what the original decoded
text contained. If the decoded text ended with \n (because the token sequence encoded
one), that newline is preserved. Otherwise the output ends without a newline.
Progress-callback wiring
The train subcommand needs a way to render progress output without adding a callback
parameter to public Tokenizer.train. FR-30 pins the public method to the exact
signature (corpus, vocab_size), so the callback lives on internal train_bpe instead.
bpetite._trainer exposes two relevant symbols: the ProgressEvent frozen dataclass
and the train_bpe function with the keyword-only progress parameter.
The three event kinds fire at three specific points in _trainer.py:
"start" fires once before the merge loop begins. merges_planned equals
vocab_size - 256; merges_completed equals 0.
"merge" fires every time (step + 1) % _PROGRESS_EVERY == 0, where
_PROGRESS_EVERY = 100. Practically: after merges 100, 200, 300, and so on.
merges_completed equals step + 1 at the moment of emission.
"complete" fires once after the loop exits, including when the trainer early-stops
because pair_counts became empty. On early-stop,
merges_completed < merges_planned.
The CLI wires the callback inside _train_with_progress in _cli.py:
python
# src/bpetite/_cli.pydef_train_with_progress(corpus:str,vocab_size:int)->TrainerResult:def_on_event(event:ProgressEvent)->None:ifevent.kind=="start":console.print(f"[info]Training started: planned={event.merges_planned}[/info]")elifevent.kind=="merge":console.print(f"[info]Training merges: {event.merges_completed}"f" / {event.merges_planned}[/info]")else:# completeconsole.print(f"[success]Training complete: merges={event.merges_completed}[/success]")try:returntrain_bpe(corpus,vocab_size,progress=_on_event)exceptValueErrorasexc:_fail(title="Invalid vocab size",message=str(exc),hint="--vocab-size must be at least 256.",)
Three observations on this wiring:
Public Tokenizer.train is not called at all. The handler calls train_bpe
directly, catches the ValueError that invalid vocab sizes raise, and constructs a
Tokenizer from the returned TrainerResult with the existing constructor. Public
Tokenizer.train remains the user-facing entry point for library callers who do not
need a callback.
The callback is plain console.print, not a Rich Progress bar. An earlier
design used rich.progress.Progress with a lazy TaskID; it broke on zero-merge
runs, early-stop runs, and invalid-vocab runs. The full design decision is at
Rich Presentation Layer.
Each lifecycle line uses a uniquely anchored substring. The lifecycle line for
the complete event is "Training complete: merges={N}". The completion panel title
is also "Training complete". A test that asserts only on "Training complete"
would pass even if the lifecycle line were deleted, because the panel title still
matches. The contract test uses the longer "Training complete: merges=" substring
specifically to anchor on the lifecycle line and catch that regression class.
Exit code taxonomy
Exit code
Condition
Source
0
Subcommand completed successfully; machine-readable payload on stdout
Normal return from the subcommand handler
1
Runtime failure caught by an explicit except clause
_fail wrapper; every branch in _read_corpus_or_exit, _load_model_or_exit, _save_or_exit, _train_with_progress, _cmd_decode
The CLI never raises uncaught exceptions out of main. Every known failure type has a
named except clause that routes through _fail to render_error on stderr and
sys.exit(1). The error panel is always structured as a title, a body message, and an
optional recovery hint, so CI log readers can grep for any of the three fields.
Subprocess-level contract tests
tests/test_cli.py is the enforcement layer. It does not import _cli; it drives the
installed console entry point through subprocess.run and captures both streams. This
preserves the channel boundary that in-process imports would collapse.
The executable._cli_executable() resolves Path(sys.executable).parent / "bpetite"
(test_cli.py:37-45), which points directly at the venv's installed console script. Two
properties matter: it is a deterministic absolute path under uv run pytest on both
macOS and Linux, and it avoids a nested uv run wrapper that would trigger a second
venv resolution step inside the already-active venv.
Session fixtures. The test suite uses two session-scoped fixtures to keep expensive
operations from running per test:
cli_trained_artifact: trains a tokenizer at vocab_size=260 against
tests/fixtures/tiny.txt exactly once per session. Every encode and decode test
reuses this artifact.
progress_corpus_path: writes a deterministic synthetic ~15 KB corpus seeded with
random.Random(0xBADC0FFEE). 2500 random ASCII words with enough distinct bigrams
that training to vocab_size=480 (224 planned merges) completes fully and fires the
"merge" event at least twice. Ordinary fixtures cannot do this. tests/fixtures/ tiny.txt completes in far fewer than 100 merges and never trips the every-100-merges
callback branch.
Channel-separation assertion pattern. Every test asserts both streams. On success:
python
result=_run_cli("encode","--model",str(cli_trained_artifact),"--text","Hello")assertresult.returncode==0ids=json.loads(result.stdout)# stdout: valid JSON, nothing elseassertisinstance(ids,list)assertall(isinstance(i,int)foriinids)# stderr assertions: either empty, or does not contain JSON-key substrings
On failure, the inverse:
python
result=_run_cli("encode","--model","nonexistent.json","--text","x")assertresult.returncode!=0assertresult.stdout==""# nothing leaked to stdout on failureassertlen(result.stderr)>0# error message reached stderr
Some tests go further and assert that specific JSON key substrings ('"corpus_bytes":',
'"elapsed_ms":', and so on) are absent from stderr, guarding against a double-write
regression where the CLI accidentally prints its summary to both channels.
Failure modes
Failure
Exception type / exit code
FR
Caught by
Machine-readable JSON leaks onto stderr
AssertionError (test)
FR-34
test_train_success_writes_json_summary_on_stdout
Progress or error text leaks onto stdout
AssertionError (test)
FR-33
Every success test that asserts result.stderr == "" and every failure test that asserts result.stdout == ""
Error path tested indirectly via _load_model_or_exit branch coverage
Missing subcommand (bpetite alone)
exit 2
FR-32
Argparse default; covered in test_no_subcommand_exits_with_usage
Silent failure modes called out by name
Three bugs are easy to miss: they pass ordinary tests and surface only
against specific test shapes. Each is pinned by a dedicated case.
Every-100-merges branch never fires. Training at vocab_size=260 against the
standard tiny fixture completes 4 merges total. The every-100-merges branch in
_train_with_progress never runs. A test that uses only the tiny fixture would pass
even if the branch were deleted. progress_corpus_path exists specifically to force
=100 merges: it writes a deterministic synthetic corpus, the test runs with
--vocab-size 480 to plan 224 merges, and the assertion on "Training merges:"
substring in stderr proves the branch fired. Deterministic seed (random.Random(0xBADC0FFEE))
means the event count is reproducible across machines and runs.
Double-anchored substring matches two rendering surfaces. The lifecycle line for
the complete event reads Training complete: merges={N} through console.print.
The completion panel title is Training complete. A test asserting only on the bare
substring "Training complete" would pass even if the entire else: # complete branch
in _train_with_progress were deleted, because the panel title would still match.
The contract test asserts on "Training complete: merges=" (note the colon and
field label) specifically to anchor on the lifecycle line, not the panel title.
Double-write regression. An implementation that writes the train JSON summary to
both stdout (via sys.stdout.write) and stderr (accidentally, through a stray
console.print(json.dumps(...))) still satisfies a bare returncode == 0 plus
json.loads(result.stdout) assertion. The contract test additionally asserts that
every key substring ('"corpus_bytes":', '"requested_vocab_size":',
'"actual_mergeable_vocab_size":', '"special_token_count":', '"elapsed_ms":') is
absent from stderr, catching this class of regression.
Related reading
Rich Presentation Layer: the shared stderr Console, the
themed palette, and the full design decision behind plain console.print lifecycle
lines instead of a live rich.progress.Progress bar.
Benchmark Harness: the elapsed_ms trainer span
distinction, cross-linked from the train JSON summary discussion above.
Phase 3 Public Tokenizer API: the five-method contract
the CLI wraps; why the CLI bypasses Tokenizer.train for the callback-enabled path.
Phase 2 Core Algorithm: the trainer the CLI drives;
the early-stop behavior that produces actual_mergeable_vocab_size < requested_vocab_size.
README.md: user-facing CLI examples with captured real outputs.