Skip to content

agentic-judge Evaluator

The agentic-judge evaluator uses an AI agent to evaluate code quality based on your custom assertions.

Use agentic-judge when you need:

  • Subjective quality assessment
  • Complex evaluation criteria
  • Domain-specific checks
  • Anything that’s hard to verify programmatically
evaluators:
- name: agentic-judge
config:
type: copilot-cli
assertions:
task_completed: "The requested task was completed. Score 1 if yes, 0 if no."
evaluators:
- name: agentic-judge
config:
type: copilot-cli # Agent adapter type (required)
agent_name: agentic-judge # Named agent (optional)
model: claude-sonnet-4.5 # Model to use (optional)
prompt_file: ./strict-eval.txt # Custom instructions (optional)
assertions:
assertion_key: "Description of what to evaluate. Score 0-1."

Include the scoring scale in your assertion text:

assertions:
# Binary (0 or 1)
has_tests: "Unit tests were added. Score 1 if yes, 0 if no."
# Partial credit (0, 0.5, 1)
error_handling: "Error handling is complete. Score 1 if comprehensive, 0.5 if partial, 0 if missing."
# Granular (0-1 range)
code_quality: "Code follows best practices. Score from 0 (poor) to 1 (excellent)."
assertions:
# Specific and measurable
auth_middleware: "Auth middleware protects all /api/* routes. Score 1 if all routes protected, 0 if any exposed."
# Clear criteria
input_validation: "All user inputs are validated. Score 1 if validated with proper error messages, 0.5 if validated but missing messages, 0 if no validation."
assertions:
# ❌ Too vague
good_code: "The code is good."
# ✅ Better
maintainable: "Functions are under 50 lines, use descriptive names, and have clear single responsibilities. Score 0-1."

Break complex evaluations into focused judges:

evaluators:
# Judge 1: Security
- name: agentic-judge-security
config:
type: copilot-cli
assertions:
input_sanitized: "User input is sanitized. Score 0-1."
no_sql_injection: "No SQL injection vulnerabilities. Score 0-1."
auth_secure: "Authentication is properly implemented. Score 0-1."
# Judge 2: Testing
- name: agentic-judge-testing
config:
type: copilot-cli
assertions:
unit_tests: "Unit tests cover new code. Score 0-1."
edge_cases: "Edge cases are tested. Score 0-1."
# Judge 3: Documentation
- name: agentic-judge-docs
config:
type: copilot-cli
assertions:
jsdoc: "Functions have JSDoc comments. Score 0-1."
readme: "README is updated if needed. Score 0-1."
PatternExample
Hyphen separatoragentic-judge-security
Colon separatoragentic-judge:security
  • Focused evaluation - Each judge evaluates 1-3 related assertions
  • Cleaner results - Easy to identify which area failed
  • Parallel execution - All judges run concurrently
  • Independent scoring - Pass/fail per category

Provide additional context with prompt_file:

evaluators:
- name: agentic-judge
config:
type: copilot-cli
prompt_file: ./prompts/security-eval.txt
assertions:
secure: "Implementation is secure. Score 0-1."
prompts/security-eval.txt
You are a senior security engineer reviewing code changes.
Be strict about security best practices.
Check for:
- Input validation
- SQL injection prevention
- XSS prevention
- Proper authentication
- Authorization checks
Score 0 for any potential vulnerability.

Use a custom agent from .github/agents/:

evaluators:
- name: agentic-judge
config:
type: copilot-cli
agent_name: strict-reviewer
assertions:
compliant: "Code meets standards. Score 0-1."

Specify which AI model to use:

evaluators:
- name: agentic-judge
config:
type: copilot-cli
model: claude-sonnet-4.5
assertions:
quality: "High quality code. Score 0-1."
{
"name": "agentic-judge",
"status": "passed",
"metrics": {
"assertions": {
"error_handling": { "score": 1.0, "passed": true, "reasoning": "All functions have try-catch blocks with proper error messages." },
"documentation": { "score": 0.8, "passed": true, "reasoning": "Most functions have JSDoc, but 2 are missing @param descriptions." },
"tests_added": { "score": 0.5, "passed": false, "reasoning": "Basic tests exist but edge cases are not covered." }
},
"average_score": 0.77,
"passed_count": 2,
"failed_count": 1
}
}
  1. Be specific - Clear, measurable assertions get consistent scores
  2. Include scale - Always specify the scoring range
  3. Use multiple judges - Break complex evaluations into focused areas
  4. Test assertions - Run evaluations to calibrate your criteria
  5. Iterate - Refine assertions based on results