Advanced Features

Unlock the full power of youBencha with advanced configuration options, custom evaluators, and pipeline extensions.

Workspace & Execution Control

Workspace Isolation

All operations occur in isolated workspaces that never mutate your original repository. Each run creates a timestamped directory with complete execution context.

.youbencha-workspace/
└── run-{timestamp}-{hash}/
    ├── src-modified/              # Code after agent execution
    ├── src-expected/              # Reference code (if configured)
    ├── artifacts/
   ├── results.json           # Machine-readable results
   ├── report.md              # Human-readable report
   ├── youbencha.log.json     # Agent execution log
   └── git-diff.patch         # Git diff output
    └── .youbencha.lock            # Workspace metadata

Key Features

  • Isolated workspace per evaluation
  • Timestamped directories for easy tracking
  • Complete execution logs in youBencha Log format
  • Option to preserve or delete workspaces after completion

Model Selection

Specify which AI model to use for agent execution and evaluation. Compare performance across different models with identical configurations.

agent:
  type: copilot-cli
  model: claude-sonnet-4.5  # Or gpt-5, gemini-3-pro-preview

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli
      model: gpt-5.1  # Use different model for evaluation

Key Features

  • Supported models: Claude (Sonnet 4.5, 4, Haiku 4.5), GPT (5, 5.1, 5.1 Codex variants), Gemini (3 Pro Preview)
  • Independent model selection for agent and evaluators
  • Track token usage and costs per model

Expected Reference Comparison

Compare agent output against a known-good reference implementation to measure similarity and correctness.

repo: https://github.com/example/repo.git
branch: main
expected_source: branch      # Or: commit
expected: feature/completed  # Reference branch/commit

evaluators:
  - name: expected-diff
    config:
      threshold: 0.85  # Require 85% similarity to pass

Key Features

  • Compare against branch or specific commit
  • Configurable similarity threshold (0-1 scale)
  • Recommended thresholds: 0.95+ for generated files, 0.80-0.90 for implementation, 0.70-0.80 for creative tasks

Custom Evaluators

Multiple Agentic Judges

Break evaluation into focused areas with specialized judges. Each judge evaluates 1-3 related assertions, running in parallel for performance.

evaluators:
  # Judge 1: Error Handling
  - name: agentic-judge-error-handling
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        has_try_catch: "Includes try-catch blocks. Score 1 if present, 0 if absent."
        proper_logging: "Errors are logged appropriately. Score 1 if yes, 0 if no."
  
  # Judge 2: Documentation
  - name: agentic-judge-documentation
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        functions_documented: "Functions have JSDoc. Score 1 if all, 0 if none."
        readme_updated: "README.md reflects changes. Score 1 if yes, 0 if no."

Key Features

  • Parallel execution for fast results
  • Independent pass/fail status per judge
  • Cleaner results with focused evaluations
  • Naming conventions: 'agentic-judge-<focus-area>' or 'agentic-judge:<focus-area>'

Reusable Evaluator Definitions

Define evaluators in separate files for reuse across multiple test cases and team sharing.

evaluators/test-coverage.yaml
# evaluators/test-coverage.yaml
name: agentic-judge:test-coverage
description: "Ensures code changes include appropriate test coverage"

config:
  type: copilot-cli
  agent_name: agentic-judge
  assertions:
    tests_added: "New tests were added. Score 1 if yes, 0 if no."
    coverage_adequate: "Test coverage is adequate. Score 1 if ≥80%, 0.5 if 60-80%, 0 if <60%."

# Reference in test case:
# evaluators:
#   - file: ./evaluators/test-coverage.yaml

Key Features

  • DRY principle for evaluator configurations
  • Easy team sharing and version control
  • Mix inline and file-based definitions

git-diff Assertions

Configure assertion-based pass/fail thresholds for code change metrics including scope, size, and entropy.

evaluators:
  - name: git-diff
    config:
      assertions:
        max_files_changed: 5        # Max files modified
        max_lines_added: 100        # Max lines added
        max_lines_removed: 50       # Max lines removed
        max_total_changes: 150      # Max total changes
        min_change_entropy: 0.5     # Min entropy (distributed)
        max_change_entropy: 2.0     # Max entropy (focused)

Key Features

  • Enforce coding standards automatically
  • Prevent overly broad or narrow changes
  • Track per-file change statistics
  • Entropy measures distribution of changes across files

Pipeline Extensions

Pre-Execution Hooks

Run custom scripts after workspace setup but before agent execution. Perfect for environment setup, code generation, or data injection.

pre_execution:
  - name: script
    config:
      command: bash
      args:
        - "-c"
        - |
          mkdir -p ${WORKSPACE_DIR}/config
          cat > ${WORKSPACE_DIR}/config/auth.json << EOF
          {"jwtSecret": "test-key"}
          EOF
      timeout_ms: 30000

Key Features

  • Available env vars: WORKSPACE_DIR, REPO_DIR, ARTIFACTS_DIR, TEST_CASE_NAME, REPO_URL, BRANCH
  • Failed pre-execution stops entire evaluation
  • Use cases: environment variables, search/replace, mock data, file setup

Database Export

Export results to structured storage for time-series analysis and trend tracking.

post_evaluation:
  - name: database
    config:
      type: json-file
      output_path: ./results-history.jsonl
      include_full_bundle: true
      append: true

Key Features

  • JSONL format for easy querying with jq or importing to analytics platforms
  • Full result bundle or summary only
  • Append mode for historical tracking
  • Future: PostgreSQL, MongoDB support

Webhook Notifications

POST evaluation results to HTTP endpoints for real-time team notifications and integrations.

post_evaluation:
  - name: webhook
    config:
      url: ${SLACK_WEBHOOK_URL}
      method: POST
      headers:
        Content-Type: "application/json"
      retry_on_failure: true

Key Features

  • Slack, Teams, Discord, or custom endpoints
  • Environment variable substitution for secrets
  • Automatic retry on failure
  • Never fails main evaluation

Custom Scripts

Execute custom scripts for advanced analysis, cleanup, or integration workflows.

post_evaluation:
  - name: script
    config:
      command: ./scripts/analyze.sh
      args: ["${RESULTS_PATH}"]
      env:
        SLACK_WEBHOOK: "${SLACK_WEBHOOK_URL}"
        ANALYSIS_MODE: "detailed"

Key Features

  • Run any custom analysis or integration
  • Access to results via RESULTS_PATH environment variable
  • Parallel execution with other post-evaluators
  • Immutable access to results

Advanced Configuration

External Prompt Files

Load prompts from external files for better organization and version control of long, complex instructions.

agent:
  type: copilot-cli
  config:
    prompt_file: ./prompts/add-authentication.md

evaluators:
  - name: agentic-judge
    config:
      prompt_file: ./prompts/strict-evaluation.txt
      assertions:
        # ... assertions here

Key Features

  • Separate concerns: configuration vs. instructions
  • Easy prompt version control and diffing
  • Cannot specify both 'prompt' and 'prompt_file'
  • Supports markdown or plain text files

Named Agents

Use custom agent configurations from .github/agents/ directory for specialized evaluation scenarios.

agent:
  type: copilot-cli
  agent_name: my-custom-agent  # Uses .github/agents/my-custom-agent/
  config:
    prompt: "Your task description"

Key Features

  • .github/agents/ directory automatically copied to workspace
  • Agent invoked with --agent <name> flag
  • Agent-specific instructions and configurations
  • Perfect for specialized evaluation scenarios

Timeout Control

Configure operation timeouts for long-running tasks or quick feedback loops.

# Test case level
timeout: 600000  # 10 minutes (default: 5 minutes)

# Pre-execution level
pre_execution:
  - name: script
    config:
      command: ./long-setup.sh
      timeout_ms: 120000  # 2 minutes

Key Features

  • Global timeout for entire evaluation
  • Per-hook timeout for pre-execution and post-evaluation
  • Prevents hanging on problematic tasks
  • Millisecond precision

Results Analysis

Results Bundle Schema

Complete evaluation artifact with test case metadata, execution details, evaluator results, and summary.

# View results summary
jq '.summary' results.json

# Calculate pass rate across suite
jq -s 'map(.summary.overall_status == "passed") | 
       map(select(.) | 1) | add / length' results/*.json

# Track metrics over time
jq '.metrics' results.json

Key Features

  • Machine-readable JSON format
  • Includes all evaluator outputs and metrics
  • Summary with overall status and counts
  • Supports programmatic analysis and automation

Time-Series & Regression Detection

Track performance over time to detect regressions, optimize costs, and ensure consistent quality.

# Detect regression
PREV=$(tail -n 2 history.jsonl | head -n 1 | jq '.summary.passed')
CURR=$(tail -n 1 history.jsonl | jq '.summary.passed')
if [ "$CURR" -lt "$PREV" ]; then
    echo "⚠️ REGRESSION DETECTED"
    exit 1
fi

# Track cost trends
jq '.execution.token_usage' history.jsonl

Key Features

  • Export to JSONL for historical tracking
  • Compare metrics across runs
  • Alert on quality degradation
  • Budget forecasting based on trends

Multiple Report Formats

Generate reports in different formats for various audiences and use cases.

# Markdown report (default)
yb report --from results.json --format markdown

# JSON report for programmatic use
yb report --from results.json --format json --output report.json

Key Features

  • Markdown for human-readable documentation
  • JSON for integration with other tools
  • Custom output paths
  • Includes all metrics and evaluator results