Best Practices
Tips and recommendations for getting the most out of youBencha evaluations.
Configuration Best Practices
Section titled “Configuration Best Practices”Start Simple
Section titled “Start Simple”Begin with basic evaluators and expand:
# Start hereevaluators: - name: git-diff - name: agentic-judge config: type: copilot-cli assertions: task_completed: "Task was completed. Score 1 if yes, 0 if no."Use Descriptive Names
Section titled “Use Descriptive Names”Name your suites and assertions clearly:
name: auth-jwt-implementationdescription: Evaluate JWT authentication for login endpoint
evaluators: - name: agentic-judge config: assertions: jwt_tokens: "JWT tokens are properly generated. Score 0-1." token_expiry: "Tokens have appropriate expiry times. Score 0-1."Make Assertions Specific
Section titled “Make Assertions Specific”Vague assertions produce inconsistent scores:
# ❌ Vagueassertions: good: "Code is good."
# ✅ Specificassertions: error_handling: | All async operations have try-catch blocks. Error messages include context for debugging. Score 1 if complete, 0.5 if partial, 0 if missing.Version Control Everything
Section titled “Version Control Everything”Track configurations in git:
youbencha/├── suites/│ ├── auth-feature.yaml│ ├── api-endpoints.yaml│ └── refactoring.yaml├── evaluators/│ ├── security.yaml│ └── testing.yaml├── prompts/│ ├── add-auth.md│ └── add-tests.md└── history/ └── .gitkeepEvaluator Best Practices
Section titled “Evaluator Best Practices”Use Multiple Focused Judges
Section titled “Use Multiple Focused Judges”Instead of one judge with many assertions:
# ✅ Better: Multiple focused judgesevaluators: - name: agentic-judge-security config: type: copilot-cli assertions: input_validation: "Inputs are validated. Score 0-1." no_injection: "No injection vulnerabilities. Score 0-1."
- name: agentic-judge-testing config: type: copilot-cli assertions: unit_tests: "Unit tests added. Score 0-1." edge_cases: "Edge cases covered. Score 0-1."Set Appropriate Thresholds
Section titled “Set Appropriate Thresholds”Match thresholds to task type:
| Task Type | expected-diff Threshold |
|---|---|
| Generated files | 0.95+ |
| Standard features | 0.80-0.90 |
| Creative tasks | 0.70-0.80 |
Combine Evaluators
Section titled “Combine Evaluators”Use multiple evaluation dimensions:
evaluators: # Scope: How much changed - name: git-diff config: assertions: max_files_changed: 10
# Correctness: Does it match expected - name: expected-diff config: threshold: 0.85
# Quality: Is it well-written - name: agentic-judge config: assertions: quality: "Code follows best practices. Score 0-1."Results Analysis Best Practices
Section titled “Results Analysis Best Practices”Export Early
Section titled “Export Early”Start building history from day one:
post_evaluation: - name: database config: type: json-file output_path: ./history/evaluations.jsonl append: trueConsistent Test Cases
Section titled “Consistent Test Cases”Keep test cases stable for trend analysis:
- Same repository and branch
- Same prompts
- Same evaluators
Alert on Regressions
Section titled “Alert on Regressions”Set up automated checks:
#!/bin/bashPREV=$(tail -n 2 history.jsonl | head -n 1 | jq '.summary.passed')CURR=$(tail -n 1 history.jsonl | jq '.summary.passed')
if [ "$CURR" -lt "$PREV" ]; then echo "⚠️ REGRESSION: $PREV → $CURR passed" # Send alertfiPrompt Engineering Best Practices
Section titled “Prompt Engineering Best Practices”Be Specific
Section titled “Be Specific”Include context and requirements:
# Add JWT Authentication
## ContextThis is a Node.js Express API using TypeScript.Authentication should use the jsonwebtoken package.
## Requirements1. Add POST /api/auth/login endpoint2. Generate JWT with 24h expiry3. Add auth middleware for protected routes4. Return proper HTTP status codes
## Constraints- Use environment variable JWT_SECRET- Do not modify existing endpointsUse Prompt Files
Section titled “Use Prompt Files”Keep prompts separate for clarity:
agent: type: copilot-cli config: prompt_file: ./prompts/add-auth.mdCI/CD Best Practices
Section titled “CI/CD Best Practices”Run on PRs
Section titled “Run on PRs”Catch issues before merge:
on: pull_request: branches: [main]Fail on Regression
Section titled “Fail on Regression”Gate merges on evaluation:
- name: Check Results run: | STATUS=$(jq -r '.summary.overall_status' results.json) [ "$STATUS" = "passed" ] || exit 1Archive Artifacts
Section titled “Archive Artifacts”Keep results for debugging:
- uses: actions/upload-artifact@v4 with: name: evaluation-results path: .youbencha-workspace/Performance Best Practices
Section titled “Performance Best Practices”Use Fast Models for Iteration
Section titled “Use Fast Models for Iteration”During development:
agent: model: claude-haiku-4.5 # Fast and cheapFor final evaluation:
agent: model: claude-sonnet-4.5 # Best qualityClean Up Workspaces
Section titled “Clean Up Workspaces”Don’t let workspaces accumulate:
# After successful runsyb run -c suite.yaml --delete-workspace
# Or periodic cleanupfind .youbencha-workspace -maxdepth 1 -mtime +7 -exec rm -rf {} \;Set Appropriate Timeouts
Section titled “Set Appropriate Timeouts”Match timeouts to task complexity:
timeout: 300000 # 5 minutes for simple taskstimeout: 600000 # 10 minutes for complex featuresDebugging Best Practices
Section titled “Debugging Best Practices”Validate Before Running
Section titled “Validate Before Running”Catch config errors early:
yb validate -c suite.yaml -v && yb run -c suite.yamlInspect Workspaces
Section titled “Inspect Workspaces”When evaluations fail:
# View agent outputcat .youbencha-workspace/run-*/artifacts/youbencha.log.json
# Check the diffcat .youbencha-workspace/run-*/artifacts/git-diff.patch
# Review modified codels -la .youbencha-workspace/run-*/src-modified/Test Incrementally
Section titled “Test Incrementally”Start with passing tests, then add complexity:
- Simple task → verify agent works
- Add git-diff → verify scope limits
- Add agentic-judge → verify quality checks
- Add expected-diff → verify correctness
Summary
Section titled “Summary”| Category | Key Practice |
|---|---|
| Configuration | Start simple, be specific, version control |
| Evaluators | Multiple focused judges, combine types |
| Results | Export early, consistent tests, alert on regression |
| Prompts | Be specific, use prompt files |
| CI/CD | Run on PRs, fail on regression, archive artifacts |
| Performance | Fast models for dev, cleanup workspaces |