Skip to content

expected-diff Evaluator

The expected-diff evaluator compares the agent’s output against a known-correct reference implementation, calculating a similarity score.

Use expected-diff when you have:

  • A “golden” solution to compare against
  • A completed feature branch
  • Known-good test fixtures

The suite configuration must include reference settings:

# Reference configuration (required for expected-diff)
expected_source: branch # or: commit
expected: feature/completed # branch name or commit SHA
evaluators:
- name: expected-diff
config:
threshold: 0.85
evaluators:
- name: expected-diff
config:
threshold: 0.85 # Require 85% similarity to pass
ThresholdMeaningUse Case
1.0 (100%)Exact matchGenerated files (migrations)
0.95-0.99Very similar, minor differencesStrict compliance
0.85-0.94Mostly similarStandard implementations
0.70-0.84Moderate similarityCreative solutions
< 0.70Significantly differentLenient evaluation
Task TypeRecommendedReasoning
Database migrations0.95+Generated files should match
Configuration files0.90-0.95Minor variations acceptable
API implementations0.80-0.90Implementation details may vary
Refactoring0.75-0.85Multiple valid approaches
UI components0.70-0.80Style variations acceptable
MetricTypeDescription
similarity_scorenumberOverall similarity (0-1)
matching_filesnumberFiles that match the reference
differing_filesnumberFiles that differ
file_similaritiesobjectPer-file similarity scores
diff_summarystringHuman-readable diff description
{
"name": "expected-diff",
"status": "passed",
"metrics": {
"similarity_score": 0.92,
"matching_files": 8,
"differing_files": 2,
"file_similarities": {
"src/auth/login.ts": 0.95,
"src/auth/middleware.ts": 0.88,
"tests/auth.test.ts": 0.94
}
},
"assertions": {
"threshold": { "expected": 0.85, "actual": 0.92, "passed": true }
}
}
suite.yaml
name: auth-feature-evaluation
description: Evaluate JWT authentication implementation
repo: https://github.com/example/api-server.git
branch: main
# Reference: the completed feature branch
expected_source: branch
expected: feature/auth-complete
agent:
type: copilot-cli
config:
prompt_file: ./prompts/add-jwt-auth.md
evaluators:
# Check similarity to reference
- name: expected-diff
config:
threshold: 0.85
# Also check scope
- name: git-diff
config:
assertions:
max_files_changed: 10
# And quality
- name: agentic-judge
config:
type: copilot-cli
assertions:
secure_implementation: "Auth is implemented securely. Score 0-1."

Compare against a specific commit instead of a branch:

expected_source: commit
expected: abc123def456789
evaluators:
- name: expected-diff
config:
threshold: 0.90

When similarity is lower than expected:

  1. Review the diff - Check file_similarities for problem areas
  2. Adjust threshold - If variations are acceptable
  3. Improve prompt - Add more specific instructions
  4. Update reference - If agent’s solution is equally valid

The evaluator:

  1. Clones both the modified and expected states
  2. Generates normalized diffs for each file
  3. Calculates similarity using diff-match-patch
  4. Weights by file size/importance
  5. Produces overall score
  1. Create reference branches carefully - They define “correct” behavior
  2. Use meaningful thresholds - Based on task type
  3. Combine with other evaluators - expected-diff alone may miss quality issues
  4. Document the reference - Explain what makes it the “golden” solution