agentic-judge Evaluator
The agentic-judge evaluator uses an AI agent to evaluate code quality based on your custom assertions.
When to Use
Section titled “When to Use”Use agentic-judge when you need:
- Subjective quality assessment
- Complex evaluation criteria
- Domain-specific checks
- Anything that’s hard to verify programmatically
Basic Usage
Section titled “Basic Usage”evaluators: - name: agentic-judge config: type: copilot-cli assertions: task_completed: "The requested task was completed. Score 1 if yes, 0 if no."Configuration
Section titled “Configuration”evaluators: - name: agentic-judge config: type: copilot-cli # Agent adapter type (required) agent_name: agentic-judge # Named agent (optional) model: claude-sonnet-4.5 # Model to use (optional) prompt_file: ./strict-eval.txt # Custom instructions (optional) assertions: assertion_key: "Description of what to evaluate. Score 0-1."Writing Assertions
Section titled “Writing Assertions”Scoring Guidelines
Section titled “Scoring Guidelines”Include the scoring scale in your assertion text:
assertions: # Binary (0 or 1) has_tests: "Unit tests were added. Score 1 if yes, 0 if no."
# Partial credit (0, 0.5, 1) error_handling: "Error handling is complete. Score 1 if comprehensive, 0.5 if partial, 0 if missing."
# Granular (0-1 range) code_quality: "Code follows best practices. Score from 0 (poor) to 1 (excellent)."Good Assertions
Section titled “Good Assertions”assertions: # Specific and measurable auth_middleware: "Auth middleware protects all /api/* routes. Score 1 if all routes protected, 0 if any exposed."
# Clear criteria input_validation: "All user inputs are validated. Score 1 if validated with proper error messages, 0.5 if validated but missing messages, 0 if no validation."Avoid Vague Assertions
Section titled “Avoid Vague Assertions”assertions: # ❌ Too vague good_code: "The code is good."
# ✅ Better maintainable: "Functions are under 50 lines, use descriptive names, and have clear single responsibilities. Score 0-1."Multiple Judges
Section titled “Multiple Judges”Break complex evaluations into focused judges:
evaluators: # Judge 1: Security - name: agentic-judge-security config: type: copilot-cli assertions: input_sanitized: "User input is sanitized. Score 0-1." no_sql_injection: "No SQL injection vulnerabilities. Score 0-1." auth_secure: "Authentication is properly implemented. Score 0-1."
# Judge 2: Testing - name: agentic-judge-testing config: type: copilot-cli assertions: unit_tests: "Unit tests cover new code. Score 0-1." edge_cases: "Edge cases are tested. Score 0-1."
# Judge 3: Documentation - name: agentic-judge-docs config: type: copilot-cli assertions: jsdoc: "Functions have JSDoc comments. Score 0-1." readme: "README is updated if needed. Score 0-1."Naming Conventions
Section titled “Naming Conventions”| Pattern | Example |
|---|---|
| Hyphen separator | agentic-judge-security |
| Colon separator | agentic-judge:security |
Benefits of Multiple Judges
Section titled “Benefits of Multiple Judges”- Focused evaluation - Each judge evaluates 1-3 related assertions
- Cleaner results - Easy to identify which area failed
- Parallel execution - All judges run concurrently
- Independent scoring - Pass/fail per category
Custom Instructions
Section titled “Custom Instructions”Provide additional context with prompt_file:
evaluators: - name: agentic-judge config: type: copilot-cli prompt_file: ./prompts/security-eval.txt assertions: secure: "Implementation is secure. Score 0-1."You are a senior security engineer reviewing code changes.Be strict about security best practices.Check for:- Input validation- SQL injection prevention- XSS prevention- Proper authentication- Authorization checksScore 0 for any potential vulnerability.Using Named Agents
Section titled “Using Named Agents”Use a custom agent from .github/agents/:
evaluators: - name: agentic-judge config: type: copilot-cli agent_name: strict-reviewer assertions: compliant: "Code meets standards. Score 0-1."Model Selection
Section titled “Model Selection”Specify which AI model to use:
evaluators: - name: agentic-judge config: type: copilot-cli model: claude-sonnet-4.5 assertions: quality: "High quality code. Score 0-1."Example Output
Section titled “Example Output”{ "name": "agentic-judge", "status": "passed", "metrics": { "assertions": { "error_handling": { "score": 1.0, "passed": true, "reasoning": "All functions have try-catch blocks with proper error messages." }, "documentation": { "score": 0.8, "passed": true, "reasoning": "Most functions have JSDoc, but 2 are missing @param descriptions." }, "tests_added": { "score": 0.5, "passed": false, "reasoning": "Basic tests exist but edge cases are not covered." } }, "average_score": 0.77, "passed_count": 2, "failed_count": 1 }}Best Practices
Section titled “Best Practices”- Be specific - Clear, measurable assertions get consistent scores
- Include scale - Always specify the scoring range
- Use multiple judges - Break complex evaluations into focused areas
- Test assertions - Run evaluations to calibrate your criteria
- Iterate - Refine assertions based on results