Skip to content

Roadmap

Current status

Pre-release

Driftcut is in active development. Phases 1-11 are now materially in place in the alpha: validation, execution, deterministic checks, decision output, HTML reporting, tiered judging for ambiguous cases, historical replay, optional Redis-backed memory with local Docker setup, richer quality scoring with category scorecards, project scaffolding, corpus bootstrap, and run comparison.


What's built

Phase 1 - Config, Corpus & Sampling :white_check_mark:

  • YAML config loading and validation (Pydantic models)
  • Corpus loading from CSV and JSON with full validation
  • Stratified batch sampler (high-criticality prioritized in early batches)
  • driftcut validate CLI command
  • CI pipeline (ruff + mypy + pytest on Python 3.12 & 3.13)

Phase 2 - Migration Runner :white_check_mark:

  • Async model execution via LiteLLM (OpenAI, Anthropic, and any LiteLLM-compatible provider)
  • Concurrent execution: baseline and candidate run in parallel per prompt
  • Latency tracker (p50, p95 per category and overall)
  • Cost tracker (per-prompt and cumulative spend)
  • driftcut run command fully wired end-to-end with Rich progress bars
  • JSON results export

Phase 3 - Deterministic Checks :white_check_mark:

  • Format-aware deterministic checks
  • JSON validity and required-key checks
  • Required / forbidden content checks
  • Output length guardrails
  • Failure archetype summaries

Phase 4 - Decision Engine & Reports :white_check_mark:

  • Threshold-based STOP / CONTINUE / PROCEED decisions
  • min_batches as a real proceed guardrail
  • High-criticality weighting in overall risk
  • Latency thresholds as decision inputs
  • HTML report generation
  • Richer JSON export with decision history

Phase 5 - Judge Layer :white_check_mark:

  • Semantic comparison for ambiguous cases
  • Judge-aware confidence and cost tracking
  • Judge details in JSON and HTML output
  • judge_worse and judge_unavailable archetype surfacing

Phase 6 - Replay Mode :white_check_mark:

  • driftcut replay for historical paired-output backtesting
  • Canonical replay JSON contract with prompt metadata
  • Shared deterministic checks, judge flow, and decision engine between live and replay
  • Replay-aware JSON and HTML report labeling

Phase 7 - Memory Layer & Local Dev :white_check_mark:

  • Optional Redis-backed baseline response caching for repeated live runs
  • Searchable run-history persistence with the same canonical payload used for JSON exports
  • Cache-hit, miss, and saved-cost reporting in JSON and HTML outputs
  • Docker and Compose assets for reproducible local Redis-backed testing

Phase 8 - Quality Scoring, Polish & Launch :white_check_mark:

  • Better per-category quality scoring
  • Richer failure archetypes beyond deterministic checks and judge_worse
  • Category-aware decision reasoning in console, JSON, and HTML output
  • HTML reports now show category scorecards and richer semantic failure buckets
  • PyPI package publish

Phase 9 - Project Scaffolding :white_check_mark:

  • driftcut init command to generate a working migration.yaml and prompts.csv
  • --baseline and --candidate flags for custom model pre-fill
  • --dir flag for target directory and --force flag for overwrite
  • Generated files pass driftcut validate out of the box

Phase 10 - Corpus Bootstrap :white_check_mark:

  • driftcut bootstrap --input raw-prompts.txt command to classify raw prompts via LLM
  • Accepts plain text, CSV, and JSON input formats
  • Auto-generates IDs, categories, criticality, and expected output types
  • Normalizes invalid LLM responses to safe defaults

Phase 11 - Run Comparison :white_check_mark:

  • driftcut diff --before results-v1.json --after results-v2.json command
  • Decision change, metric deltas, per-category risk shifts, cost differences
  • Archetype additions and removals between runs
  • Color-coded Rich output: green for improvements, red for regressions

Phase 12 - Public Benchmark Demo :white_check_mark:

  • End-to-end cost-cut walkthrough under examples/demo/ in the app repo
  • Compares gpt-4o against gpt-4o-mini and claude-3.5-haiku, surfacing complementary failure profiles per category
  • Replay configs reproduce both STOP decisions deterministically without any API key

What's next

Nothing committed. The next items would only be built if real demand emerges.

  • Sequential hypothesis testing (SPRT) for more formal confidence estimates
  • CI/CD integration to run Driftcut as a migration gate in pipelines
  • Web dashboard for history, cross-run comparison, and collaboration
  • Scheduled checks for periodic canary runs against production models