Skip to content

Evaluators

Evaluators analyze and score the changes made by AI agents. youBencha provides three built-in evaluators that can be combined for comprehensive assessment.

git-diff

Analyze Git changes made by the agent with assertion-based pass/fail thresholds. Learn more →

expected-diff

Compare agent output against a known-correct reference implementation. Learn more →

agentic-judge

Use AI to evaluate code quality based on custom assertions. Learn more →

EvaluatorUse CaseRequires Reference?Configuration
git-diffTrack scope, enforce limitsNoAssertions on metrics
expected-diffCompare to golden solutionYesThreshold (0-1)
agentic-judgeSubjective quality assessmentNoCustom assertions

Control the scope of changes allowed by the agent:

evaluators:
- name: git-diff

Metrics Produced:

MetricTypeDescription
files_changednumberCount of modified files
lines_addednumberTotal lines added
lines_removednumberTotal lines removed
total_changesnumberSum of additions + deletions
change_entropynumberDistribution of changes
changed_filesarrayList of file paths

Compare against a known-correct implementation:

# Reference is a branch
expected_source: branch
expected: feature/auth-complete
evaluators:
- name: expected-diff
config:
threshold: 0.85

Threshold Guidelines:

RangeUse CaseExample
0.95+Generated filesMigrations, configs
0.80-0.90Implementation codeBusiness logic
0.70-0.80Creative tasksMultiple valid solutions
<0.70Very lenientExploratory changes

Use AI to evaluate based on custom criteria:

evaluators:
- name: agentic-judge
config:
type: copilot-cli
assertions:
task_completed: "The task was completed. Score 1 if yes, 0 if no."

Begin with git-diff and agentic-judge:

evaluators:
- name: git-diff
config:
assertions:
max_files_changed: 10
- name: agentic-judge
config:
type: copilot-cli
assertions:
task_completed: "The task was completed. Score 1 if yes, 0 if no."

When you have a known-good solution, add expected-diff:

expected_source: branch
expected: feature/completed
evaluators:
- name: git-diff
- name: expected-diff
config:
threshold: 0.85
- name: agentic-judge
config:
type: copilot-cli
assertions:
task_completed: "The task was completed. Score 1 if yes, 0 if no."

Use all three for thorough assessment:

evaluators:
- name: git-diff
config:
assertions:
max_files_changed: 5
max_lines_added: 100
- name: expected-diff
config:
threshold: 0.80
- name: agentic-judge
config:
type: copilot-cli
assertions:
code_quality: "Code follows best practices. Score 0-1."
tests_added: "Appropriate tests were added. Score 0-1."

All evaluators run in parallel for performance. Results are aggregated after all complete.

Agent Execution Complete
┌────┴────┐
│ Parallel │
└────┬────┘
┌────┼────┐
▼ ▼ ▼
git-diff │ expected-diff │ agentic-judge
│ │ │
└────┼────┘
Results Aggregation
Pass/Fail Determination

For advanced use cases, see:

Use CaseEvaluatorsWhy
Quick smoke testgit-diff onlyFast, no AI cost
Standard evaluationgit-diff + agentic-judgeBalance of scope and quality
Regression testinggit-diff + expected-diffDeterministic comparison
Full assessmentAll threeComprehensive but slower