Skip to content

Best Practices

Tips and recommendations for getting the most out of youBencha evaluations.

Begin with basic evaluators and expand:

# Start here
evaluators:
- name: git-diff
- name: agentic-judge
config:
type: copilot-cli
assertions:
task_completed: "Task was completed. Score 1 if yes, 0 if no."

Name your suites and assertions clearly:

name: auth-jwt-implementation
description: Evaluate JWT authentication for login endpoint
evaluators:
- name: agentic-judge
config:
assertions:
jwt_tokens: "JWT tokens are properly generated. Score 0-1."
token_expiry: "Tokens have appropriate expiry times. Score 0-1."

Vague assertions produce inconsistent scores:

# ❌ Vague
assertions:
good: "Code is good."
# ✅ Specific
assertions:
error_handling: |
All async operations have try-catch blocks.
Error messages include context for debugging.
Score 1 if complete, 0.5 if partial, 0 if missing.

Track configurations in git:

youbencha/
├── suites/
│ ├── auth-feature.yaml
│ ├── api-endpoints.yaml
│ └── refactoring.yaml
├── evaluators/
│ ├── security.yaml
│ └── testing.yaml
├── prompts/
│ ├── add-auth.md
│ └── add-tests.md
└── history/
└── .gitkeep

Instead of one judge with many assertions:

# ✅ Better: Multiple focused judges
evaluators:
- name: agentic-judge-security
config:
type: copilot-cli
assertions:
input_validation: "Inputs are validated. Score 0-1."
no_injection: "No injection vulnerabilities. Score 0-1."
- name: agentic-judge-testing
config:
type: copilot-cli
assertions:
unit_tests: "Unit tests added. Score 0-1."
edge_cases: "Edge cases covered. Score 0-1."

Match thresholds to task type:

Task Typeexpected-diff Threshold
Generated files0.95+
Standard features0.80-0.90
Creative tasks0.70-0.80

Use multiple evaluation dimensions:

evaluators:
# Scope: How much changed
- name: git-diff
config:
assertions:
max_files_changed: 10
# Correctness: Does it match expected
- name: expected-diff
config:
threshold: 0.85
# Quality: Is it well-written
- name: agentic-judge
config:
assertions:
quality: "Code follows best practices. Score 0-1."

Start building history from day one:

post_evaluation:
- name: database
config:
type: json-file
output_path: ./history/evaluations.jsonl
append: true

Keep test cases stable for trend analysis:

  • Same repository and branch
  • Same prompts
  • Same evaluators

Set up automated checks:

#!/bin/bash
PREV=$(tail -n 2 history.jsonl | head -n 1 | jq '.summary.passed')
CURR=$(tail -n 1 history.jsonl | jq '.summary.passed')
if [ "$CURR" -lt "$PREV" ]; then
echo "⚠️ REGRESSION: $PREV$CURR passed"
# Send alert
fi

Include context and requirements:

prompts/add-auth.md
# Add JWT Authentication
## Context
This is a Node.js Express API using TypeScript.
Authentication should use the jsonwebtoken package.
## Requirements
1. Add POST /api/auth/login endpoint
2. Generate JWT with 24h expiry
3. Add auth middleware for protected routes
4. Return proper HTTP status codes
## Constraints
- Use environment variable JWT_SECRET
- Do not modify existing endpoints

Keep prompts separate for clarity:

agent:
type: copilot-cli
config:
prompt_file: ./prompts/add-auth.md

Catch issues before merge:

on:
pull_request:
branches: [main]

Gate merges on evaluation:

- name: Check Results
run: |
STATUS=$(jq -r '.summary.overall_status' results.json)
[ "$STATUS" = "passed" ] || exit 1

Keep results for debugging:

- uses: actions/upload-artifact@v4
with:
name: evaluation-results
path: .youbencha-workspace/

During development:

agent:
model: claude-haiku-4.5 # Fast and cheap

For final evaluation:

agent:
model: claude-sonnet-4.5 # Best quality

Don’t let workspaces accumulate:

Terminal window
# After successful runs
yb run -c suite.yaml --delete-workspace
# Or periodic cleanup
find .youbencha-workspace -maxdepth 1 -mtime +7 -exec rm -rf {} \;

Match timeouts to task complexity:

timeout: 300000 # 5 minutes for simple tasks
timeout: 600000 # 10 minutes for complex features

Catch config errors early:

Terminal window
yb validate -c suite.yaml -v && yb run -c suite.yaml

When evaluations fail:

Terminal window
# View agent output
cat .youbencha-workspace/run-*/artifacts/youbencha.log.json
# Check the diff
cat .youbencha-workspace/run-*/artifacts/git-diff.patch
# Review modified code
ls -la .youbencha-workspace/run-*/src-modified/

Start with passing tests, then add complexity:

  1. Simple task → verify agent works
  2. Add git-diff → verify scope limits
  3. Add agentic-judge → verify quality checks
  4. Add expected-diff → verify correctness
CategoryKey Practice
ConfigurationStart simple, be specific, version control
EvaluatorsMultiple focused judges, combine types
ResultsExport early, consistent tests, alert on regression
PromptsBe specific, use prompt files
CI/CDRun on PRs, fail on regression, archive artifacts
PerformanceFast models for dev, cleanup workspaces