agentic-judge Evaluator

The agentic-judge evaluator uses an AI agent to evaluate code quality based on your custom assertions.

When to Use

Use agentic-judge when you need:

Subjective quality assessment
Complex evaluation criteria
Domain-specific checks
Anything that’s hard to verify programmatically

Basic Usage

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli
      assertions:
        task_completed: "The requested task was completed. Score 1 if yes, 0 if no."

Configuration

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli              # Agent adapter type (required)
      agent_name: agentic-judge      # Named agent (optional)
      model: claude-sonnet-4.5        # Model to use (optional)
      prompt_file: ./strict-eval.txt  # Custom instructions (optional)
      assertions:
        assertion_key: "Description of what to evaluate. Score 0-1."

Writing Assertions

Scoring Guidelines

Include the scoring scale in your assertion text:

assertions:
  # Binary (0 or 1)
  has_tests: "Unit tests were added. Score 1 if yes, 0 if no."

  # Partial credit (0, 0.5, 1)
  error_handling: "Error handling is complete. Score 1 if comprehensive, 0.5 if partial, 0 if missing."

  # Granular (0-1 range)
  code_quality: "Code follows best practices. Score from 0 (poor) to 1 (excellent)."

Good Assertions

assertions:
  # Specific and measurable
  auth_middleware: "Auth middleware protects all /api/* routes. Score 1 if all routes protected, 0 if any exposed."

  # Clear criteria
  input_validation: "All user inputs are validated. Score 1 if validated with proper error messages, 0.5 if validated but missing messages, 0 if no validation."

Avoid Vague Assertions

assertions:
  # ❌ Too vague
  good_code: "The code is good."

  # ✅ Better
  maintainable: "Functions are under 50 lines, use descriptive names, and have clear single responsibilities. Score 0-1."

Multiple Judges

Break complex evaluations into focused judges:

evaluators:
  # Judge 1: Security
  - name: agentic-judge-security
    config:
      type: copilot-cli
      assertions:
        input_sanitized: "User input is sanitized. Score 0-1."
        no_sql_injection: "No SQL injection vulnerabilities. Score 0-1."
        auth_secure: "Authentication is properly implemented. Score 0-1."

  # Judge 2: Testing
  - name: agentic-judge-testing
    config:
      type: copilot-cli
      assertions:
        unit_tests: "Unit tests cover new code. Score 0-1."
        edge_cases: "Edge cases are tested. Score 0-1."

  # Judge 3: Documentation
  - name: agentic-judge-docs
    config:
      type: copilot-cli
      assertions:
        jsdoc: "Functions have JSDoc comments. Score 0-1."
        readme: "README is updated if needed. Score 0-1."

Naming Conventions

Pattern	Example
Hyphen separator	`agentic-judge-security`
Colon separator	`agentic-judge:security`

Benefits of Multiple Judges

Focused evaluation - Each judge evaluates 1-3 related assertions
Cleaner results - Easy to identify which area failed
Parallel execution - All judges run concurrently
Independent scoring - Pass/fail per category

Custom Instructions

Provide additional context with prompt_file:

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli
      prompt_file: ./prompts/security-eval.txt
      assertions:
        secure: "Implementation is secure. Score 0-1."

You are a senior security engineer reviewing code changes.
Be strict about security best practices.
Check for:
- Input validation
- SQL injection prevention
- XSS prevention
- Proper authentication
- Authorization checks
Score 0 for any potential vulnerability.

Using Named Agents

Use a custom agent from .github/agents/:

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli
      agent_name: strict-reviewer
      assertions:
        compliant: "Code meets standards. Score 0-1."

Model Selection

Specify which AI model to use:

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli
      model: claude-sonnet-4.5
      assertions:
        quality: "High quality code. Score 0-1."

Example Output

{
  "name": "agentic-judge",
  "status": "passed",
  "metrics": {
    "assertions": {
      "error_handling": { "score": 1.0, "passed": true, "reasoning": "All functions have try-catch blocks with proper error messages." },
      "documentation": { "score": 0.8, "passed": true, "reasoning": "Most functions have JSDoc, but 2 are missing @param descriptions." },
      "tests_added": { "score": 0.5, "passed": false, "reasoning": "Basic tests exist but edge cases are not covered." }
    },
    "average_score": 0.77,
    "passed_count": 2,
    "failed_count": 1
  }
}

Best Practices

Be specific - Clear, measurable assertions get consistent scores
Include scale - Always specify the scoring range
Use multiple judges - Break complex evaluations into focused areas
Test assertions - Run evaluations to calibrate your criteria
Iterate - Refine assertions based on results