Advanced Features

Unlock the full power of youBencha with advanced configuration options, custom evaluators, and pipeline extensions.

Workspace & Execution Control

Workspace Isolation

All operations occur in isolated workspaces that never mutate your original repository. Each run creates a timestamped directory with complete execution context.

.youbencha-workspace/
└── run-{timestamp}-{hash}/
    ├── src-modified/              # Code after agent execution
    ├── src-expected/              # Reference code (if configured)
    ├── artifacts/
    │   ├── results.json           # Machine-readable results
    │   ├── report.md              # Human-readable report
    │   ├── youbencha.log.json     # Agent execution log
    │   └── git-diff.patch         # Git diff output
    └── .youbencha.lock            # Workspace metadata

Key Features

Isolated workspace per evaluation
Timestamped directories for easy tracking
Complete execution logs in youBencha Log format
Option to preserve or delete workspaces after completion

Model Selection

Specify which AI model to use for agent execution and evaluation. Compare performance across different models with identical configurations.

agent:
  type: copilot-cli
  model: claude-sonnet-4.5  # Or gpt-5, gemini-3-pro-preview

evaluators:
  - name: agentic-judge
    config:
      type: copilot-cli
      model: gpt-5.1  # Use different model for evaluation

Key Features

Supported models: Claude (Sonnet 4.5, 4, Haiku 4.5), GPT (5, 5.1, 5.1 Codex variants), Gemini (3 Pro Preview)
Independent model selection for agent and evaluators
Track token usage and costs per model

Expected Reference Comparison

Compare agent output against a known-good reference implementation to measure similarity and correctness.

repo: https://github.com/example/repo.git
branch: main
expected_source: branch      # Or: commit
expected: feature/completed  # Reference branch/commit

evaluators:
  - name: expected-diff
    config:
      threshold: 0.85  # Require 85% similarity to pass

Key Features

Compare against branch or specific commit
Configurable similarity threshold (0-1 scale)
Recommended thresholds: 0.95+ for generated files, 0.80-0.90 for implementation, 0.70-0.80 for creative tasks

Custom Evaluators

Multiple Agentic Judges

Break evaluation into focused areas with specialized judges. Each judge evaluates 1-3 related assertions, running in parallel for performance.

evaluators:
  # Judge 1: Error Handling
  - name: agentic-judge-error-handling
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        has_try_catch: "Includes try-catch blocks. Score 1 if present, 0 if absent."
        proper_logging: "Errors are logged appropriately. Score 1 if yes, 0 if no."
  
  # Judge 2: Documentation
  - name: agentic-judge-documentation
    config:
      type: copilot-cli
      agent_name: agentic-judge
      assertions:
        functions_documented: "Functions have JSDoc. Score 1 if all, 0 if none."
        readme_updated: "README.md reflects changes. Score 1 if yes, 0 if no."

Key Features

Parallel execution for fast results
Independent pass/fail status per judge
Cleaner results with focused evaluations
Naming conventions: 'agentic-judge-<focus-area>' or 'agentic-judge:<focus-area>'

Reusable Evaluator Definitions

Define evaluators in separate files for reuse across multiple test cases and team sharing.

evaluators/test-coverage.yaml

# evaluators/test-coverage.yaml
name: agentic-judge:test-coverage
description: "Ensures code changes include appropriate test coverage"

config:
  type: copilot-cli
  agent_name: agentic-judge
  assertions:
    tests_added: "New tests were added. Score 1 if yes, 0 if no."
    coverage_adequate: "Test coverage is adequate. Score 1 if ≥80%, 0.5 if 60-80%, 0 if <60%."

# Reference in test case:
# evaluators:
#   - file: ./evaluators/test-coverage.yaml

Key Features

DRY principle for evaluator configurations
Easy team sharing and version control
Mix inline and file-based definitions

git-diff Assertions

Configure assertion-based pass/fail thresholds for code change metrics including scope, size, and entropy.

evaluators:
  - name: git-diff
    config:
      assertions:
        max_files_changed: 5        # Max files modified
        max_lines_added: 100        # Max lines added
        max_lines_removed: 50       # Max lines removed
        max_total_changes: 150      # Max total changes
        min_change_entropy: 0.5     # Min entropy (distributed)
        max_change_entropy: 2.0     # Max entropy (focused)

Key Features

Enforce coding standards automatically
Prevent overly broad or narrow changes
Track per-file change statistics
Entropy measures distribution of changes across files

Pipeline Extensions

Pre-Execution Hooks

Run custom scripts after workspace setup but before agent execution. Perfect for environment setup, code generation, or data injection.

pre_execution:
  - name: script
    config:
      command: bash
      args:
        - "-c"
        - |
          mkdir -p ${WORKSPACE_DIR}/config
          cat > ${WORKSPACE_DIR}/config/auth.json << EOF
          {"jwtSecret": "test-key"}
          EOF
      timeout_ms: 30000

Key Features

Available env vars: WORKSPACE_DIR, REPO_DIR, ARTIFACTS_DIR, TEST_CASE_NAME, REPO_URL, BRANCH
Failed pre-execution stops entire evaluation
Use cases: environment variables, search/replace, mock data, file setup

Database Export

Export results to structured storage for time-series analysis and trend tracking.

post_evaluation:
  - name: database
    config:
      type: json-file
      output_path: ./results-history.jsonl
      include_full_bundle: true
      append: true

Key Features

JSONL format for easy querying with jq or importing to analytics platforms
Full result bundle or summary only
Append mode for historical tracking
Future: PostgreSQL, MongoDB support

Webhook Notifications

POST evaluation results to HTTP endpoints for real-time team notifications and integrations.

post_evaluation:
  - name: webhook
    config:
      url: ${SLACK_WEBHOOK_URL}
      method: POST
      headers:
        Content-Type: "application/json"
      retry_on_failure: true

Key Features

Slack, Teams, Discord, or custom endpoints
Environment variable substitution for secrets
Automatic retry on failure
Never fails main evaluation

Custom Scripts

Execute custom scripts for advanced analysis, cleanup, or integration workflows.

post_evaluation:
  - name: script
    config:
      command: ./scripts/analyze.sh
      args: ["${RESULTS_PATH}"]
      env:
        SLACK_WEBHOOK: "${SLACK_WEBHOOK_URL}"
        ANALYSIS_MODE: "detailed"

Key Features

Run any custom analysis or integration
Access to results via RESULTS_PATH environment variable
Parallel execution with other post-evaluators
Immutable access to results

Advanced Configuration

External Prompt Files

Load prompts from external files for better organization and version control of long, complex instructions.

agent:
  type: copilot-cli
  config:
    prompt_file: ./prompts/add-authentication.md

evaluators:
  - name: agentic-judge
    config:
      prompt_file: ./prompts/strict-evaluation.txt
      assertions:
        # ... assertions here

Key Features

Separate concerns: configuration vs. instructions
Easy prompt version control and diffing
Cannot specify both 'prompt' and 'prompt_file'
Supports markdown or plain text files

Named Agents

Use custom agent configurations from .github/agents/ directory for specialized evaluation scenarios.

agent:
  type: copilot-cli
  agent_name: my-custom-agent  # Uses .github/agents/my-custom-agent/
  config:
    prompt: "Your task description"

Key Features

.github/agents/ directory automatically copied to workspace
Agent invoked with --agent <name> flag
Agent-specific instructions and configurations
Perfect for specialized evaluation scenarios

Timeout Control

Configure operation timeouts for long-running tasks or quick feedback loops.

# Test case level
timeout: 600000  # 10 minutes (default: 5 minutes)

# Pre-execution level
pre_execution:
  - name: script
    config:
      command: ./long-setup.sh
      timeout_ms: 120000  # 2 minutes

Key Features

Global timeout for entire evaluation
Per-hook timeout for pre-execution and post-evaluation
Prevents hanging on problematic tasks
Millisecond precision

Results Analysis

Results Bundle Schema

Complete evaluation artifact with test case metadata, execution details, evaluator results, and summary.

# View results summary
jq '.summary' results.json

# Calculate pass rate across suite
jq -s 'map(.summary.overall_status == "passed") | 
       map(select(.) | 1) | add / length' results/*.json

# Track metrics over time
jq '.metrics' results.json

Key Features

Machine-readable JSON format
Includes all evaluator outputs and metrics
Summary with overall status and counts
Supports programmatic analysis and automation

Time-Series & Regression Detection

Track performance over time to detect regressions, optimize costs, and ensure consistent quality.

# Detect regression
PREV=$(tail -n 2 history.jsonl | head -n 1 | jq '.summary.passed')
CURR=$(tail -n 1 history.jsonl | jq '.summary.passed')
if [ "$CURR" -lt "$PREV" ]; then
    echo "⚠️ REGRESSION DETECTED"
    exit 1
fi

# Track cost trends
jq '.execution.token_usage' history.jsonl

Key Features

Export to JSONL for historical tracking
Compare metrics across runs
Alert on quality degradation
Budget forecasting based on trends

Multiple Report Formats

Generate reports in different formats for various audiences and use cases.

# Markdown report (default)
yb report --from results.json --format markdown

# JSON report for programmatic use
yb report --from results.json --format json --output report.json

Key Features

Markdown for human-readable documentation
JSON for integration with other tools
Custom output paths
Includes all metrics and evaluator results