Concept¶
Design document
This page describes Driftcut's design philosophy and planned architecture. Features marked with :white_check_mark: are implemented; others are under development. See the Roadmap for current status.
Core insight¶
Driftcut doesn't do a complete evaluation. It answers a simpler, earlier question:
"Should we continue this migration, or is it already proving to be a bad idea?"
The core of the project is early-stop decision support for model migration - not dashboards, experiment tracking, prompt management, or generic evaluation.
What Driftcut is¶
- A pre-evaluation filter
- A migration canary
- A budget-saving decision layer
- A tool to decide whether it's worth continuing
What Driftcut is not¶
- A general eval framework
- An experiment tracking platform
- A prompt optimization system
- A full LLM observability tool
- A replacement for a full evaluation
If your team already uses an eval framework for full evaluations, Driftcut sits before it - the filter that decides whether the full evaluation is worth running.
How it works¶
Instead of evaluating the entire corpus, Driftcut:
- Divides the corpus into categories. :white_check_mark:
- Samples small, representative batches - prioritizing high-criticality prompts. :white_check_mark:
- Compares baseline and candidate on latency and cost. :white_check_mark:
- Runs deterministic checks and classifies concrete failures such as schema breaks, missing content, and empty outputs. :white_check_mark:
- Judges only ambiguous prompts when semantic comparison is needed. :white_check_mark:
- Decides: stop the test, continue sampling, or declare the candidate ready for full evaluation. :white_check_mark:
The value is avoiding the discovery - too late - that the test was going badly.
The same decision pipeline can also be used in replay mode, where historical paired outputs are sampled and evaluated without re-calling the baseline or candidate models.
Three dimensions of comparison¶
Migration isn't just about output quality. Driftcut compares baseline and candidate across three dimensions:
Quality :white_check_mark:¶
Output quality relative to the baseline: format adherence, completeness, correctness, absence of obvious structural breaks.
The current alpha already combines deterministic checks with tiered judge-based comparison for ambiguous prompts, richer semantic archetypes, and per-category quality scoring.
Latency :white_check_mark:¶
Response time of the candidate relative to the baseline.
For many teams, latency is the primary driver of migration - or the reason it fails. Driftcut measures p50, p95, and variance per category, and flags significant latency regressions even when quality is stable.
Cost :white_check_mark:¶
Per-prompt cost and total run cost.
Driftcut tracks progressive spend and the spend avoided by stopping an unpromising test early.
Decision engine :white_check_mark:¶
The decision engine is heuristic-based and explicitly designed as decision support, not an infallible oracle.
Possible outcomes¶
After each batch, Driftcut produces one of three decisions:
| Decision | Meaning |
|---|---|
| Stop now | The candidate is failing critical categories. Abort. |
| Continue sampling | Signals are mixed. More data needed. |
| Proceed to full evaluation | The candidate looks promising across the board. |
Stopping logic¶
Stop now if:
- A high-criticality category exceeds the failure threshold (default: 20%)
- Schema breaks are repeated (default: 25% of batch)
- Divergence stays high across consecutive batches
Continue if:
- Signals are mixed or unstable
- No severe failures but high variance
Proceed if:
- Critical categories remain stable for at least
min_batches - Divergence stays below the risk threshold (default: 8%)
- No structural breaks
- Latency shows no significant regressions
Calibration¶
Default thresholds are conservative - they favor false negatives (stopping a test that might have been fine) over false positives (approving a candidate that fails in production).
You can and should calibrate them via config. The report shows how close results are to each threshold boundary.
Failure archetypes :white_check_mark:¶
The report won't just say "quality drop." It classifies concrete failure modes that are already useful in migration triage:
| Archetype | Description |
|---|---|
| api_error | Model call still failed after Driftcut retried transient transport/provider errors |
| empty_output | Response is empty |
| json_invalid | Output is not valid JSON |
| missing_json_keys | Required keys are missing from parsed JSON |
| invalid_labels | Label output could not be parsed |
| missing_required_content | Required substring was not found |
| forbidden_content | Forbidden substring was found |
| overlong_output | Output exceeded max_output_chars |
| refusal_regression | Candidate refused or deflected a task the baseline completed |
| instruction_miss | Candidate missed the core task even though it remained syntactically valid |
| incomplete_answer | Candidate dropped useful detail or coverage relative to baseline |
| format_drift | Candidate drifted away from the expected presentation or structure |
| hallucination_risk | Judge flagged unsupported or fabricated content risk |
| semantic_regression | Candidate was materially worse in meaning, but no more specific semantic bucket applied |
This turns an abstract score into actionable information. Broader qualitative archetypes such as tone-specific regressions are still future work.
The judge cost paradox¶
Driftcut promises to save budget. But if every comparison requires a judge model call, the judge cost can become significant.
Tiered strategy¶
Driftcut addresses this with progressive judge levels:
-
Deterministic checks (zero cost) - Is the output valid JSON? Does it match the schema? Is it a refusal? These catch the most obvious failures without spending anything.
-
Light judge - For prompts that pass deterministic checks, a small, cheap model (e.g. GPT-4.1-mini) handles general quality comparison.
-
Heavy judge - With
judge_strategy: tiered, Driftcut escalates automatically when the light judge confidence falls belowtiered_escalation_threshold. You can also choose the heavy judge directly.
A typical canary run (120 prompts, 20% tested, 24 prompts) costs roughly $0.50-$2.00 in judge calls - a fraction of a full evaluation.
Statistical confidence¶
Current approach (v0.11.1)¶
Driftcut uses a pragmatic approach:
- Stratified sampling by category and criticality ensures batches are representative
- Conservative thresholds minimize the risk of false positives
- Transparent reporting shows sample size and corpus coverage so you can judge signal robustness yourself
Future¶
Sequential hypothesis testing (SPRT or variants) will provide formal confidence estimates: "with this much data, the probability that the candidate is adequate is above/below threshold X."