Skip to content

Configuration Reference

Driftcut is configured through a single YAML file. All sections except name and models have sensible defaults. corpus is required for live run, but replay mode gets prompt metadata from the replay input file instead.

Full example

name: "GPT-4o to Claude Haiku migration gate"
description: "Early-stop migration test for support and extraction workloads"

models:
  baseline:
    provider: openai
    model: gpt-4o
  candidate:
    provider: anthropic
    model: claude-haiku

corpus:
  file: prompts.csv

sampling:
  batch_size_per_category: 3
  max_batches: 5
  min_batches: 2

risk:
  high_criticality_weight: 2.0
  stop_on_schema_break_rate: 0.25
  stop_on_high_criticality_failure_rate: 0.20
  proceed_if_overall_risk_below: 0.08

evaluation:
  judge_strategy: tiered
  judge_model_light: openai/gpt-4.1-mini
  judge_model_heavy: openai/gpt-4.1
  tiered_escalation_threshold: 0.6
  detect_failure_archetypes: true

latency:
  track: true
  regression_threshold_p50: 1.5
  regression_threshold_p95: 2.0

output:
  save_json: true
  save_html: true
  save_examples: true
  show_thresholds: true
  show_confidence: true

Section reference

models

Required. Defines the two models to compare.

Field Type Description
baseline.provider string Provider name (e.g. openai, anthropic, openrouter)
baseline.model string Model identifier
baseline.api_key string Optional. Overrides the environment variable for this model
baseline.api_base string Optional. Custom API endpoint (for proxies, Azure, self-hosted)
candidate.provider string Provider name
candidate.model string Model identifier
candidate.api_key string Optional. Overrides the environment variable for this model
candidate.api_base string Optional. Custom API endpoint

API keys are loaded from environment variables following each provider's convention such as OPENAI_API_KEY, ANTHROPIC_API_KEY, and OPENROUTER_API_KEY. You can override per-model with the api_key field.

Driftcut uses LiteLLM under the hood, so any LiteLLM-supported provider works.

Live call reliability

In v0.11.1, live run calls automatically retry transient rate limits, timeouts, connection failures, and 5xx responses before Driftcut records an api_error. Saved JSON artifacts include retry_count for each baseline/candidate response.

corpus

Required. Points to the prompt corpus file.

Field Type Description
file path Path to CSV or JSON corpus (relative to config file)

The current corpus format also supports optional deterministic expectation fields such as required_substrings, forbidden_substrings, json_required_keys, and max_output_chars.

sampling

Controls how prompts are sampled into batches.

Field Default Description
batch_size_per_category 3 Prompts drawn per category per batch
max_batches 5 Hard cap on number of batches
min_batches 2 Minimum evidence before Driftcut can declare PROCEED

Current alpha behavior

min_batches is active in v0.11.1: Driftcut will not declare PROCEED until at least this many batches have been evaluated.

risk

Thresholds that drive the stop/continue/proceed decision. Defaults are conservative and favor stopping too early over approving a bad candidate.

Field Default Description
high_criticality_weight 2.0 Weight multiplier for high-criticality categories
stop_on_schema_break_rate 0.25 Stop if schema breaks exceed this rate
stop_on_high_criticality_failure_rate 0.20 Stop if high-crit failures exceed this rate
proceed_if_overall_risk_below 0.08 Proceed to full eval if risk stays below this

Calibrating thresholds

Start with defaults. If Driftcut stops too aggressively, raise the thresholds. If it lets bad candidates through, lower them. The report shows how close results are to each threshold boundary.

evaluation

Controls judge behavior for semantic comparison after deterministic checks.

Field Default Description
judge_strategy light One of: none, light, tiered, heavy
judge_model_light openai/gpt-4.1-mini Model for light judging
judge_model_heavy openai/gpt-4.1 Model for heavy judging
tiered_escalation_threshold 0.6 Escalate from light to heavy when confidence falls below this threshold
detect_failure_archetypes true Classify failures into archetypes

Judge strategies:

  • none - No judge calls. Only deterministic checks. Zero extra cost.
  • light - Judge only ambiguous prompts with the light model.
  • tiered - Judge ambiguous prompts with the light model first, then escalate to the heavy model when confidence is below tiered_escalation_threshold.
  • heavy - Judge ambiguous prompts with the heavy model instead of the light model.

latency

Controls latency tracking and regression detection.

Field Default Description
track true Enable latency measurement
regression_threshold_p50 1.5 Flag if candidate p50 is greater than 1.5x baseline
regression_threshold_p95 2.0 Flag if candidate p95 is greater than 2.0x baseline

Current alpha behavior

Latency is measured and reported today. The thresholds are active decision inputs in v0.11.1.

Replay mode

Replay mode uses the same sampling, risk, evaluation, latency, and output sections, but it reads prompt metadata plus paired baseline/candidate outputs from a canonical replay JSON file:

driftcut replay --config replay.yaml --input replay.json

The replay input contract is versioned and intentionally narrow. Each record must include:

  • prompt metadata such as id, category, prompt, criticality, and expected_output_type
  • nested baseline and candidate objects
  • either output or error for each side
  • latency_ms when latency.track=true

Historical model cost is optional. Replay-time judge cost is tracked separately in the report when semantic judging is enabled.

Decision outputs

In v0.11.1, Driftcut reports:

  • multiple failure_archetypes per prompt when a response breaks in more than one way
  • category_scores in the run-level metrics for JSON and HTML outputs
  • category-aware decision reasons that point to the highest-risk category instead of only quoting thresholds

memory

Optional. Enables the Redis-backed memory layer for baseline response caching and run-history persistence.

memory:
  backend: redis
  redis_url: redis://localhost:6379/0
  namespace: driftcut-dev
  response_cache:
    enabled: true
    ttl_seconds: 604800
  run_history:
    enabled: true
    ttl_seconds: 2592000
Field Default Description
backend redis Current memory backend
redis_url none Redis connection URL
namespace driftcut Prefix used for cache and run-history keys
response_cache.enabled true Reuse cached baseline responses in live runs
response_cache.ttl_seconds 604800 Cache TTL in seconds (7 days)
run_history.enabled true Persist completed run payloads to Redis
run_history.ttl_seconds 2592000 Run-history TTL in seconds (30 days)

Baseline cache semantics

Cached baseline responses are intentionally excluded from live latency comparison. Driftcut reuses the output and records cache hits and saved baseline cost, but it does not treat cached latency as fresh live latency evidence.

Failure behavior

Redis is optional. If the memory layer is disabled, Driftcut behaves exactly as before. If Redis is configured but temporarily unavailable at runtime, Driftcut falls back to the normal live path instead of failing the migration gate.

output

Controls what gets saved after a run.

Field Default Status Description
save_json true :white_check_mark: Export results as JSON
save_html true :white_check_mark: Generate HTML report
save_examples true :white_check_mark: Include failure examples in report output
show_thresholds true :white_check_mark: Show threshold values in report output
show_confidence true :white_check_mark: Show confidence indicator