Skip to content

Configuration Reference

Driftcut is configured through a single YAML file. All sections except name, models, and corpus have sensible defaults.

Full example

name: "GPT-4o to Claude Haiku migration gate"
description: "Early-stop migration test for support and extraction workloads"

models:
  baseline:
    provider: openai
    model: gpt-4o
  candidate:
    provider: anthropic
    model: claude-haiku

corpus:
  file: prompts.csv

sampling:
  batch_size_per_category: 3
  max_batches: 5
  min_batches: 2

risk:
  high_criticality_weight: 2.0
  stop_on_schema_break_rate: 0.25
  stop_on_high_criticality_failure_rate: 0.20
  proceed_if_overall_risk_below: 0.08

evaluation:
  judge_strategy: tiered
  judge_model_light: openai/gpt-4.1-mini
  judge_model_heavy: openai/gpt-4.1
  detect_failure_archetypes: true

latency:
  track: true
  regression_threshold_p50: 1.5
  regression_threshold_p95: 2.0

output:
  save_json: true
  save_html: true
  save_examples: true
  show_thresholds: true
  show_confidence: true

Section reference

models

Required. Defines the two models to compare.

Field Type Description
baseline.provider string Provider name (e.g. openai, anthropic, openrouter)
baseline.model string Model identifier
baseline.api_key string Optional. Overrides the environment variable for this model
baseline.api_base string Optional. Custom API endpoint (for proxies, Azure, self-hosted)
candidate.provider string Provider name
candidate.model string Model identifier
candidate.api_key string Optional. Overrides the environment variable for this model
candidate.api_base string Optional. Custom API endpoint

API keys are loaded from environment variables following each provider's convention (e.g. OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY). You can override per-model with the api_key field.

Driftcut uses LiteLLM under the hood, so any LiteLLM-supported provider works.

corpus

Required. Points to the prompt corpus file.

Field Type Description
file path Path to CSV or JSON corpus (relative to config file)

sampling

Controls how prompts are sampled into batches.

Field Default Description
batch_size_per_category 3 Prompts drawn per category per batch
max_batches 5 Hard cap on number of batches
min_batches 2 Minimum batches before declaring "proceed"

risk (parsed, not yet active)

Thresholds that will drive the stop/continue/proceed decision. These values are validated at config load time but have no effect on run output until the decision engine is implemented. Defaults are conservative — they favor stopping too early over approving a bad candidate.

Field Default Description
high_criticality_weight 2.0 Weight multiplier for high-criticality categories
stop_on_schema_break_rate 0.25 Stop if schema breaks exceed this rate
stop_on_high_criticality_failure_rate 0.20 Stop if high-crit failures exceed this rate
proceed_if_overall_risk_below 0.08 Proceed to full eval if risk stays below this

Calibrating thresholds

Start with defaults. If you find Driftcut stops too aggressively, raise the thresholds. If it lets bad candidates through, lower them. The report shows how close results are to each threshold boundary.

evaluation (parsed, not yet active)

Controls the judge strategy for semantic comparison. These values are validated at config load time but have no effect until the judge adapter is implemented.

Field Default Description
judge_strategy tiered One of: none, light, tiered, heavy
judge_model_light openai/gpt-4.1-mini Model for light judging
judge_model_heavy openai/gpt-4.1 Model for heavy judging (ambiguous cases)
detect_failure_archetypes true Classify failures into archetypes

Judge strategies:

  • none — No judge calls. Only deterministic checks (schema, format, refusal). Zero extra cost.
  • light — Use the light model for all semantic comparisons.
  • tiered — Deterministic checks first, light judge for ambiguous cases, heavy judge only when still unclear. Best cost/accuracy balance.
  • heavy — Use the heavy model for all comparisons. Most accurate but most expensive.

latency

Controls latency tracking and regression detection.

Field Default Description
track true Enable latency measurement
regression_threshold_p50 1.5 Flag if candidate p50 > 1.5x baseline
regression_threshold_p95 2.0 Flag if candidate p95 > 2.0x baseline

output

Controls what gets saved after a run.

Field Default Status Description
save_json true :white_check_mark: Export results as JSON
save_html true coming soon Generate HTML report
save_examples true coming soon Include failure examples in report
show_thresholds true coming soon Show threshold values in report
show_confidence true coming soon Show confidence indicator