expected-diff Evaluator

The expected-diff evaluator compares the agent’s output against a known-correct reference implementation, calculating a similarity score.

When to Use

Use expected-diff when you have:

A “golden” solution to compare against
A completed feature branch
Known-good test fixtures

Requirements

The suite configuration must include reference settings:

# Reference configuration (required for expected-diff)
expected_source: branch  # or: commit
expected: feature/completed  # branch name or commit SHA

evaluators:
  - name: expected-diff
    config:
      threshold: 0.85

Configuration

evaluators:
  - name: expected-diff
    config:
      threshold: 0.85  # Require 85% similarity to pass

Threshold Guidelines

Threshold	Meaning	Use Case
1.0 (100%)	Exact match	Generated files (migrations)
0.95-0.99	Very similar, minor differences	Strict compliance
0.85-0.94	Mostly similar	Standard implementations
0.70-0.84	Moderate similarity	Creative solutions
< 0.70	Significantly different	Lenient evaluation

Recommended Thresholds by Task Type

Task Type	Recommended	Reasoning
Database migrations	0.95+	Generated files should match
Configuration files	0.90-0.95	Minor variations acceptable
API implementations	0.80-0.90	Implementation details may vary
Refactoring	0.75-0.85	Multiple valid approaches
UI components	0.70-0.80	Style variations acceptable

Metrics Produced

Metric	Type	Description
`similarity_score`	number	Overall similarity (0-1)
`matching_files`	number	Files that match the reference
`differing_files`	number	Files that differ
`file_similarities`	object	Per-file similarity scores
`diff_summary`	string	Human-readable diff description

Example Output

{
  "name": "expected-diff",
  "status": "passed",
  "metrics": {
    "similarity_score": 0.92,
    "matching_files": 8,
    "differing_files": 2,
    "file_similarities": {
      "src/auth/login.ts": 0.95,
      "src/auth/middleware.ts": 0.88,
      "tests/auth.test.ts": 0.94
    }
  },
  "assertions": {
    "threshold": { "expected": 0.85, "actual": 0.92, "passed": true }
  }
}

Complete Example

name: auth-feature-evaluation
description: Evaluate JWT authentication implementation

repo: https://github.com/example/api-server.git
branch: main

# Reference: the completed feature branch
expected_source: branch
expected: feature/auth-complete

agent:
  type: copilot-cli
  config:
    prompt_file: ./prompts/add-jwt-auth.md

evaluators:
  # Check similarity to reference
  - name: expected-diff
    config:
      threshold: 0.85

  # Also check scope
  - name: git-diff
    config:
      assertions:
        max_files_changed: 10

  # And quality
  - name: agentic-judge
    config:
      type: copilot-cli
      assertions:
        secure_implementation: "Auth is implemented securely. Score 0-1."

Using Commit References

Compare against a specific commit instead of a branch:

expected_source: commit
expected: abc123def456789

evaluators:
  - name: expected-diff
    config:
      threshold: 0.90

Handling Low Similarity

When similarity is lower than expected:

Review the diff - Check file_similarities for problem areas
Adjust threshold - If variations are acceptable
Improve prompt - Add more specific instructions
Update reference - If agent’s solution is equally valid

Comparison Algorithm

The evaluator:

Clones both the modified and expected states
Generates normalized diffs for each file
Calculates similarity using diff-match-patch
Weights by file size/importance
Produces overall score

Best Practices

Create reference branches carefully - They define “correct” behavior
Use meaningful thresholds - Based on task type
Combine with other evaluators - expected-diff alone may miss quality issues
Document the reference - Explain what makes it the “golden” solution