Best Practices

Tips and recommendations for getting the most out of youBencha evaluations.

Configuration Best Practices

Start Simple

Begin with basic evaluators and expand:

# Start here
evaluators:
  - name: git-diff
  - name: agentic-judge
    config:
      type: copilot-cli
      assertions:
        task_completed: "Task was completed. Score 1 if yes, 0 if no."

Use Descriptive Names

Name your suites and assertions clearly:

name: auth-jwt-implementation
description: Evaluate JWT authentication for login endpoint

evaluators:
  - name: agentic-judge
    config:
      assertions:
        jwt_tokens: "JWT tokens are properly generated. Score 0-1."
        token_expiry: "Tokens have appropriate expiry times. Score 0-1."

Make Assertions Specific

Vague assertions produce inconsistent scores:

# ❌ Vague
assertions:
  good: "Code is good."

# ✅ Specific
assertions:
  error_handling: |
    All async operations have try-catch blocks.
    Error messages include context for debugging.
    Score 1 if complete, 0.5 if partial, 0 if missing.

Version Control Everything

Track configurations in git:

youbencha/
├── suites/
│   ├── auth-feature.yaml
│   ├── api-endpoints.yaml
│   └── refactoring.yaml
├── evaluators/
│   ├── security.yaml
│   └── testing.yaml
├── prompts/
│   ├── add-auth.md
│   └── add-tests.md
└── history/
    └── .gitkeep

Evaluator Best Practices

Use Multiple Focused Judges

Instead of one judge with many assertions:

# ✅ Better: Multiple focused judges
evaluators:
  - name: agentic-judge-security
    config:
      type: copilot-cli
      assertions:
        input_validation: "Inputs are validated. Score 0-1."
        no_injection: "No injection vulnerabilities. Score 0-1."

  - name: agentic-judge-testing
    config:
      type: copilot-cli
      assertions:
        unit_tests: "Unit tests added. Score 0-1."
        edge_cases: "Edge cases covered. Score 0-1."

Set Appropriate Thresholds

Match thresholds to task type:

Task Type	expected-diff Threshold
Generated files	0.95+
Standard features	0.80-0.90
Creative tasks	0.70-0.80

Combine Evaluators

Use multiple evaluation dimensions:

evaluators:
  # Scope: How much changed
  - name: git-diff
    config:
      assertions:
        max_files_changed: 10

  # Correctness: Does it match expected
  - name: expected-diff
    config:
      threshold: 0.85

  # Quality: Is it well-written
  - name: agentic-judge
    config:
      assertions:
        quality: "Code follows best practices. Score 0-1."

Results Analysis Best Practices

Export Early

Start building history from day one:

post_evaluation:
  - name: database
    config:
      type: json-file
      output_path: ./history/evaluations.jsonl
      append: true

Consistent Test Cases

Keep test cases stable for trend analysis:

Same repository and branch
Same prompts
Same evaluators

Alert on Regressions

Set up automated checks:

#!/bin/bash
PREV=$(tail -n 2 history.jsonl | head -n 1 | jq '.summary.passed')
CURR=$(tail -n 1 history.jsonl | jq '.summary.passed')

if [ "$CURR" -lt "$PREV" ]; then
    echo "⚠️ REGRESSION: $PREV → $CURR passed"
    # Send alert
fi

Prompt Engineering Best Practices

Be Specific

Include context and requirements:

# Add JWT Authentication

## Context
This is a Node.js Express API using TypeScript.
Authentication should use the jsonwebtoken package.

## Requirements
1. Add POST /api/auth/login endpoint
2. Generate JWT with 24h expiry
3. Add auth middleware for protected routes
4. Return proper HTTP status codes

## Constraints
- Use environment variable JWT_SECRET
- Do not modify existing endpoints

Use Prompt Files

Keep prompts separate for clarity:

agent:
  type: copilot-cli
  config:
    prompt_file: ./prompts/add-auth.md

CI/CD Best Practices

Run on PRs

Catch issues before merge:

on:
  pull_request:
    branches: [main]

Fail on Regression

Gate merges on evaluation:

- name: Check Results
  run: |
    STATUS=$(jq -r '.summary.overall_status' results.json)
    [ "$STATUS" = "passed" ] || exit 1

Archive Artifacts

Keep results for debugging:

- uses: actions/upload-artifact@v4
  with:
    name: evaluation-results
    path: .youbencha-workspace/

Performance Best Practices

Use Fast Models for Iteration

During development:

agent:
  model: claude-haiku-4.5  # Fast and cheap

For final evaluation:

agent:
  model: claude-sonnet-4.5  # Best quality

Clean Up Workspaces

Don’t let workspaces accumulate:

# After successful runs
yb run -c suite.yaml --delete-workspace

# Or periodic cleanup
find .youbencha-workspace -maxdepth 1 -mtime +7 -exec rm -rf {} \;

Set Appropriate Timeouts

Match timeouts to task complexity:

timeout: 300000  # 5 minutes for simple tasks
timeout: 600000  # 10 minutes for complex features

Debugging Best Practices

Validate Before Running

Catch config errors early:

yb validate -c suite.yaml -v && yb run -c suite.yaml

Inspect Workspaces

When evaluations fail:

# View agent output
cat .youbencha-workspace/run-*/artifacts/youbencha.log.json

# Check the diff
cat .youbencha-workspace/run-*/artifacts/git-diff.patch

# Review modified code
ls -la .youbencha-workspace/run-*/src-modified/

Test Incrementally

Start with passing tests, then add complexity:

Simple task → verify agent works
Add git-diff → verify scope limits
Add agentic-judge → verify quality checks
Add expected-diff → verify correctness

Summary

Category	Key Practice
Configuration	Start simple, be specific, version control
Evaluators	Multiple focused judges, combine types
Results	Export early, consistent tests, alert on regression
Prompts	Be specific, use prompt files
CI/CD	Run on PRs, fail on regression, archive artifacts
Performance	Fast models for dev, cleanup workspaces