Evaluators

Evaluators analyze and score the changes made by AI agents. youBencha provides three built-in evaluators that can be combined for comprehensive assessment.

Built-in Evaluators

git-diff

Analyze Git changes made by the agent with assertion-based pass/fail thresholds. Learn more →

expected-diff

Compare agent output against a known-correct reference implementation. Learn more →

agentic-judge

Use AI to evaluate code quality based on custom assertions. Learn more →

Quick Reference

Evaluator	Use Case	Requires Reference?	Configuration
`git-diff`	Track scope, enforce limits	No	Assertions on metrics
`expected-diff`	Compare to golden solution	Yes	Threshold (0-1)
`agentic-judge`	Subjective quality assessment	No	Custom assertions

Complete Configuration Examples

git-diff Configuration

Control the scope of changes allowed by the agent:

evaluators:
  - name: git-diff

evaluators:
  - name: git-diff
    config:
      assertions:
        max_files_changed: 5
        max_lines_added: 100
        max_lines_removed: 50

evaluators:
  - name: git-diff
    config:
      assertions:
        max_files_changed: 10       # Limit file count
        max_lines_added: 200        # Limit additions
        max_lines_removed: 100      # Limit deletions
        max_total_changes: 300      # Limit total changes
        min_change_entropy: 0.3     # Enforce distributed changes
        max_change_entropy: 2.5     # Prevent too spread out

Metrics Produced:

Metric	Type	Description
`files_changed`	number	Count of modified files
`lines_added`	number	Total lines added
`lines_removed`	number	Total lines removed
`total_changes`	number	Sum of additions + deletions
`change_entropy`	number	Distribution of changes
`changed_files`	array	List of file paths

expected-diff Configuration

Compare against a known-correct implementation:

Branch Reference
Commit Reference

# Reference is a branch
expected_source: branch
expected: feature/auth-complete

evaluators:
  - name: expected-diff
    config:
      threshold: 0.85

# Reference is a specific commit
expected_source: commit
expected: abc123def456

evaluators:
  - name: expected-diff
    config:
      threshold: 0.90

Threshold Guidelines:

Range	Use Case	Example
`0.95+`	Generated files	Migrations, configs
`0.80-0.90`	Implementation code	Business logic
`0.70-0.80`	Creative tasks	Multiple valid solutions
`<0.70`	Very lenient	Exploratory changes

agentic-judge Configuration

Use AI to evaluate based on custom criteria:

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli
      assertions:
        task_completed: "The task was completed. Score 1 if yes, 0 if no."

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli
      model: claude-sonnet-4.5
      assertions:
        code_quality: "Code follows best practices. Score 0-1."
        error_handling: "Proper error handling. Score 1 if complete, 0.5 if partial, 0 if missing."
        tests_added: "Appropriate tests were added. Score 0-1."
        documentation: "Code is documented. Score 0-1."

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli
      agent_name: agentic-judge
      prompt_file: ./prompts/strict-security-eval.txt
      assertions:
        security: "No security vulnerabilities. Score 1 if secure, 0 if any issues."
        input_validation: "All inputs are validated. Score 0-1."

Choosing Evaluators

Start Simple

Begin with git-diff and agentic-judge:

evaluators:
  - name: git-diff
    config:
      assertions:
        max_files_changed: 10
  - name: agentic-judge
    config:
      type: copilot-cli
      assertions:
        task_completed: "The task was completed. Score 1 if yes, 0 if no."

Add Reference Comparison

When you have a known-good solution, add expected-diff:

expected_source: branch
expected: feature/completed

evaluators:
  - name: git-diff
  - name: expected-diff
    config:
      threshold: 0.85
  - name: agentic-judge
    config:
      type: copilot-cli
      assertions:
        task_completed: "The task was completed. Score 1 if yes, 0 if no."

Combine for Comprehensive Evaluation

Use all three for thorough assessment:

evaluators:
  - name: git-diff
    config:
      assertions:
        max_files_changed: 5
        max_lines_added: 100

  - name: expected-diff
    config:
      threshold: 0.80

  - name: agentic-judge
    config:
      type: copilot-cli
      assertions:
        code_quality: "Code follows best practices. Score 0-1."
        tests_added: "Appropriate tests were added. Score 0-1."

Parallel Execution

All evaluators run in parallel for performance. Results are aggregated after all complete.

Evaluation Flow

Agent Execution Complete
         │
         ▼
    ┌────┴────┐
    │ Parallel │
    └────┬────┘
         │
    ┌────┼────┐
    ▼    ▼    ▼
git-diff │ expected-diff │ agentic-judge
    │    │    │
    └────┼────┘
         │
         ▼
  Results Aggregation
         │
         ▼
   Pass/Fail Determination

Custom Evaluators

For advanced use cases, see:

Recommended Combinations

Use Case	Evaluators	Why
Quick smoke test	`git-diff` only	Fast, no AI cost
Standard evaluation	`git-diff` + `agentic-judge`	Balance of scope and quality
Regression testing	`git-diff` + `expected-diff`	Deterministic comparison
Full assessment	All three	Comprehensive but slower