Driftcut¶
Early-stop decision gating for LLM model migrations.
v0.11.1 alpha CLI for sampling migration candidates before you commit to a full evaluation.
What is Driftcut?¶
Driftcut is a CLI for the step before a full migration evaluation. Instead of immediately running your whole prompt corpus against a candidate model, Driftcut samples strategically, runs baseline and candidate models on representative batches, checks deterministic quality signals, sends ambiguous prompts to a judge model, and gives you an early migration decision.
The problem it solves¶
When a team wants to switch from one LLM to another for cost, latency, privacy, or quality, the evaluation process is wasteful:
- Pick a candidate model
- Run the entire test corpus
- Wait hours for results
- Discover at the end that the candidate breaks critical categories
- Start over with another candidate
Driftcut answers a simpler question first:
"Should we continue this migration, or are we already seeing enough risk to stop?"
Quick links¶
- Getting Started - Install and run your first migration test
- Configuration - Full YAML config reference
- CLI Reference - Commands, flags, and examples
- Concept - Design philosophy and planned architecture
- Roadmap - What's built and what's next
Current status¶
Pre-release
Driftcut is in active development. The current alpha ships driftcut init scaffolding, driftcut bootstrap corpus generation, driftcut diff run comparison, deterministic checks, tiered judge escalation for ambiguous prompts, historical replay, richer failure archetype summaries, per-category scorecards, and STOP / CONTINUE / PROCEED decisions.
What works today¶
- Config and corpus validation
- Stratified sampling by category and criticality
- Concurrent baseline/candidate execution via LiteLLM
- Transient retry handling for rate limits, timeouts, connection failures, and 5xxs
- Historical replay on canonical paired-output JSON
- Deterministic checks for format, JSON validity, required content, and output length limits
- Tiered judging with light-to-heavy escalation for ambiguous prompts
- Latency tracking plus baseline, candidate, and judge cost tracking
- Optional Redis memory for baseline caching and run-history persistence
STOP/CONTINUE/PROCEEDdecisions during the run- JSON results export
- HTML report generation
- Per-category quality scorecards in JSON and HTML output
- Richer semantic failure archetypes such as
refusal_regression,instruction_miss, andincomplete_answer driftcut initscaffolding for instant project setupdriftcut bootstrapto classify raw prompts into a structured corpus via LLMdriftcut diffto compare two runs and surface metric, category, and archetype changes- A real cost-cut demo under
examples/demo/in the app repo, with offline replay configs
What comes next¶
- More real-world case studies and packaging polish
- Sequential hypothesis testing for formal confidence (demand-driven)
See the output¶
The fastest way to understand Driftcut is to look at the artifacts it produces.
Terminal summary¶
$ driftcut run --config migration.yaml
GPT-4o to Claude Haiku migration gate
Mode: live
Baseline: openai/gpt-4o
Candidate: anthropic/claude-haiku
Batch 1: 12 prompts, 0 API errors, $0.1840 cumulative
Decision: CONTINUE (58% confidence)
Judge coverage: 3/3 ambiguous prompts
Risk is still borderline after the first sampled batch.
Batch 2: 12 prompts, 0 API errors, $0.3120 cumulative
Decision: PROCEED (82% confidence)
Judge coverage: 4/4 ambiguous prompts
Risk stayed below the configured proceed threshold.
Run complete
Prompts tested: 24/30
Batches tested: 2
Total cost: $0.3120
Judge cost: $0.0280
Latency p50: 910ms (baseline) -> 690ms (candidate)
Latency p95: 1480ms (baseline) -> 1100ms (candidate)
Decision: PROCEED (82% confidence)
Reason: Risk stayed below the configured proceed threshold.
Top category: extraction (11% risk | json_invalid x2)
results.json excerpt¶
{
"mode": "live",
"decision": {
"outcome": "PROCEED",
"confidence": 0.82,
"reason": "Risk stayed below the configured proceed threshold."
},
"decision_history": [
{
"outcome": "PROCEED",
"metrics": {
"category_scores": [
{
"category": "extraction",
"overall_risk": 0.11,
"archetypes": {
"json_invalid": 2
}
}
]
}
}
],
"cost": {
"baseline_usd": 0.184,
"candidate_usd": 0.1,
"judge_usd": 0.028,
"total_usd": 0.312
},
"batches": [
{
"batch_number": 1,
"results": [
{
"prompt_id": "cx-001",
"candidate": {
"latency_ms": 640.0,
"retry_count": 1,
"cost_usd": 0.009,
"error": null
}
}
]
}
]
}
You also get an HTML report with the same decision, threshold context, latency/cost summary, failure archetypes, and prompt examples.