git-diff
Analyze Git changes made by the agent with assertion-based pass/fail thresholds. Learn more →
Evaluators analyze and score the changes made by AI agents. youBencha provides three built-in evaluators that can be combined for comprehensive assessment.
git-diff
Analyze Git changes made by the agent with assertion-based pass/fail thresholds. Learn more →
expected-diff
Compare agent output against a known-correct reference implementation. Learn more →
agentic-judge
Use AI to evaluate code quality based on custom assertions. Learn more →
| Evaluator | Use Case | Requires Reference? | Configuration |
|---|---|---|---|
git-diff | Track scope, enforce limits | No | Assertions on metrics |
expected-diff | Compare to golden solution | Yes | Threshold (0-1) |
agentic-judge | Subjective quality assessment | No | Custom assertions |
Control the scope of changes allowed by the agent:
evaluators: - name: git-diffevaluators: - name: git-diff config: assertions: max_files_changed: 5 max_lines_added: 100 max_lines_removed: 50evaluators: - name: git-diff config: assertions: max_files_changed: 10 # Limit file count max_lines_added: 200 # Limit additions max_lines_removed: 100 # Limit deletions max_total_changes: 300 # Limit total changes min_change_entropy: 0.3 # Enforce distributed changes max_change_entropy: 2.5 # Prevent too spread outMetrics Produced:
| Metric | Type | Description |
|---|---|---|
files_changed | number | Count of modified files |
lines_added | number | Total lines added |
lines_removed | number | Total lines removed |
total_changes | number | Sum of additions + deletions |
change_entropy | number | Distribution of changes |
changed_files | array | List of file paths |
Compare against a known-correct implementation:
# Reference is a branchexpected_source: branchexpected: feature/auth-complete
evaluators: - name: expected-diff config: threshold: 0.85# Reference is a specific commitexpected_source: commitexpected: abc123def456
evaluators: - name: expected-diff config: threshold: 0.90Threshold Guidelines:
| Range | Use Case | Example |
|---|---|---|
0.95+ | Generated files | Migrations, configs |
0.80-0.90 | Implementation code | Business logic |
0.70-0.80 | Creative tasks | Multiple valid solutions |
<0.70 | Very lenient | Exploratory changes |
Use AI to evaluate based on custom criteria:
evaluators: - name: agentic-judge config: type: copilot-cli assertions: task_completed: "The task was completed. Score 1 if yes, 0 if no."evaluators: - name: agentic-judge config: type: copilot-cli model: claude-sonnet-4.5 assertions: code_quality: "Code follows best practices. Score 0-1." error_handling: "Proper error handling. Score 1 if complete, 0.5 if partial, 0 if missing." tests_added: "Appropriate tests were added. Score 0-1." documentation: "Code is documented. Score 0-1."evaluators: - name: agentic-judge config: type: copilot-cli agent_name: agentic-judge prompt_file: ./prompts/strict-security-eval.txt assertions: security: "No security vulnerabilities. Score 1 if secure, 0 if any issues." input_validation: "All inputs are validated. Score 0-1."Begin with git-diff and agentic-judge:
evaluators: - name: git-diff config: assertions: max_files_changed: 10 - name: agentic-judge config: type: copilot-cli assertions: task_completed: "The task was completed. Score 1 if yes, 0 if no."When you have a known-good solution, add expected-diff:
expected_source: branchexpected: feature/completed
evaluators: - name: git-diff - name: expected-diff config: threshold: 0.85 - name: agentic-judge config: type: copilot-cli assertions: task_completed: "The task was completed. Score 1 if yes, 0 if no."Use all three for thorough assessment:
evaluators: - name: git-diff config: assertions: max_files_changed: 5 max_lines_added: 100
- name: expected-diff config: threshold: 0.80
- name: agentic-judge config: type: copilot-cli assertions: code_quality: "Code follows best practices. Score 0-1." tests_added: "Appropriate tests were added. Score 0-1."All evaluators run in parallel for performance. Results are aggregated after all complete.
Agent Execution Complete │ ▼ ┌────┴────┐ │ Parallel │ └────┬────┘ │ ┌────┼────┐ ▼ ▼ ▼git-diff │ expected-diff │ agentic-judge │ │ │ └────┼────┘ │ ▼ Results Aggregation │ ▼ Pass/Fail DeterminationFor advanced use cases, see:
| Use Case | Evaluators | Why |
|---|---|---|
| Quick smoke test | git-diff only | Fast, no AI cost |
| Standard evaluation | git-diff + agentic-judge | Balance of scope and quality |
| Regression testing | git-diff + expected-diff | Deterministic comparison |
| Full assessment | All three | Comprehensive but slower |