expected-diff Evaluator
The expected-diff evaluator compares the agent’s output against a known-correct reference implementation, calculating a similarity score.
When to Use
Section titled “When to Use”Use expected-diff when you have:
- A “golden” solution to compare against
- A completed feature branch
- Known-good test fixtures
Requirements
Section titled “Requirements”The suite configuration must include reference settings:
# Reference configuration (required for expected-diff)expected_source: branch # or: commitexpected: feature/completed # branch name or commit SHA
evaluators: - name: expected-diff config: threshold: 0.85Configuration
Section titled “Configuration”evaluators: - name: expected-diff config: threshold: 0.85 # Require 85% similarity to passThreshold Guidelines
Section titled “Threshold Guidelines”| Threshold | Meaning | Use Case |
|---|---|---|
| 1.0 (100%) | Exact match | Generated files (migrations) |
| 0.95-0.99 | Very similar, minor differences | Strict compliance |
| 0.85-0.94 | Mostly similar | Standard implementations |
| 0.70-0.84 | Moderate similarity | Creative solutions |
| < 0.70 | Significantly different | Lenient evaluation |
Recommended Thresholds by Task Type
Section titled “Recommended Thresholds by Task Type”| Task Type | Recommended | Reasoning |
|---|---|---|
| Database migrations | 0.95+ | Generated files should match |
| Configuration files | 0.90-0.95 | Minor variations acceptable |
| API implementations | 0.80-0.90 | Implementation details may vary |
| Refactoring | 0.75-0.85 | Multiple valid approaches |
| UI components | 0.70-0.80 | Style variations acceptable |
Metrics Produced
Section titled “Metrics Produced”| Metric | Type | Description |
|---|---|---|
similarity_score | number | Overall similarity (0-1) |
matching_files | number | Files that match the reference |
differing_files | number | Files that differ |
file_similarities | object | Per-file similarity scores |
diff_summary | string | Human-readable diff description |
Example Output
Section titled “Example Output”{ "name": "expected-diff", "status": "passed", "metrics": { "similarity_score": 0.92, "matching_files": 8, "differing_files": 2, "file_similarities": { "src/auth/login.ts": 0.95, "src/auth/middleware.ts": 0.88, "tests/auth.test.ts": 0.94 } }, "assertions": { "threshold": { "expected": 0.85, "actual": 0.92, "passed": true } }}Complete Example
Section titled “Complete Example”name: auth-feature-evaluationdescription: Evaluate JWT authentication implementation
repo: https://github.com/example/api-server.gitbranch: main
# Reference: the completed feature branchexpected_source: branchexpected: feature/auth-complete
agent: type: copilot-cli config: prompt_file: ./prompts/add-jwt-auth.md
evaluators: # Check similarity to reference - name: expected-diff config: threshold: 0.85
# Also check scope - name: git-diff config: assertions: max_files_changed: 10
# And quality - name: agentic-judge config: type: copilot-cli assertions: secure_implementation: "Auth is implemented securely. Score 0-1."Using Commit References
Section titled “Using Commit References”Compare against a specific commit instead of a branch:
expected_source: commitexpected: abc123def456789
evaluators: - name: expected-diff config: threshold: 0.90Handling Low Similarity
Section titled “Handling Low Similarity”When similarity is lower than expected:
- Review the diff - Check
file_similaritiesfor problem areas - Adjust threshold - If variations are acceptable
- Improve prompt - Add more specific instructions
- Update reference - If agent’s solution is equally valid
Comparison Algorithm
Section titled “Comparison Algorithm”The evaluator:
- Clones both the modified and expected states
- Generates normalized diffs for each file
- Calculates similarity using diff-match-patch
- Weights by file size/importance
- Produces overall score
Best Practices
Section titled “Best Practices”- Create reference branches carefully - They define “correct” behavior
- Use meaningful thresholds - Based on task type
- Combine with other evaluators -
expected-diffalone may miss quality issues - Document the reference - Explain what makes it the “golden” solution