Bencha mascot youBencha
Confident developer benchmarking AI coding agents

Evaluate AI Coding Agents with Confidence

A developer-first CLI framework for testing AI-powered coding tools. Run agents, measure their output, and get objective insights—all through a simple command-line interface.

Get Started in 3 Steps

From zero to first evaluation in under 5 minutes

1

Install youBencha

Install globally via npm or use npx for one-time runs.

npm install -g youbencha

# Verify installation
yb --version
2

Initialize Configuration

Create a starter configuration file with helpful comments, then customize for your needs.

# Create suite.yaml with helpful comments
yb init
3

Run Evaluation

Execute the evaluation suite and generate a comprehensive report.

yb run -c suite.yaml
yb report --from .youbencha-workspace/run-*/artifacts/results.json

Requirements

  • Node.js 20+ (check with: node --version)
  • Git installed (check with: git --version)
  • AI coding agent (GitHub Copilot CLI: npm install -g @githubnext/github-copilot-cli)
  • No additional authentication - runs in your local environment context

Quick Reference

Essential commands and links for working with youBencha

CLI Commands

yb init

Create a starter configuration file with helpful comments

yb init [--force]
yb run

Execute an evaluation suite against an AI agent

yb run -c suite.yaml [--delete-workspace]
yb report

Generate human-readable report from evaluation results

yb report --from results.json [--format markdown|json]
yb validate

Validate suite configuration without running evaluation

yb validate -c suite.yaml [-v]
yb list

List available built-in evaluators and their descriptions

yb list
yb suggest-suite

Generate evaluation suite suggestions using AI

yb suggest-suite --agent copilot-cli --output-dir ./output

Quick Links

Why youBencha?

Objective evaluation for AI coding agents

Agent-Agnostic Architecture

Pluggable adapter system works with any AI agent. Start with GitHub Copilot CLI today, switch to other agents tomorrow. No vendor lock-in.

Multi-Dimensional Evaluation

Evaluate beyond 'does it compile?' with git-diff for scope analysis, expected-diff for similarity scoring, and agentic-judge for AI-powered quality assessment.

Reproducible & Isolated

Safe, repeatable evaluations in isolated workspaces. Complete execution logs in youBencha Log format. Never modifies your working directory.

Developer-Friendly CLI

Simple YAML configuration, one-command execution (yb run), and comprehensive reporting. Built by developers, for developers.

Extensible Pipeline

Pre-execution hooks for environment setup, custom evaluators for domain-specific criteria, and post-evaluation hooks for database export and notifications.

GitHub Stars
Open Source
MIT License
Community
Growing Active

Use Cases

Real-world scenarios where youBencha delivers value

Rapid Prompt Iteration

AI engineers get immediate feedback during prompt engineering. Debug agent failures with full context (logs, diffs, metrics). Iterate quickly on agent configurations and track token usage and costs per evaluation.

# Edit prompt, run evaluation, review results, iterate
yb run -c suite.yaml
yb report --from .youbencha-workspace/run-*/artifacts/results.json

Objective Agent Comparison

Development teams can compare different models (GPT-5, Claude Sonnet, Gemini) on identical tasks. Identify which tasks are hardest for agents, aggregate metrics across test suites, and make data-driven tool decisions.

# Compare models on same task
agent:
  type: copilot-cli
  model: claude-sonnet-4.5  # Or gpt-5, gemini-3-pro-preview

Regression Detection & Trend Analysis

Organizations track performance across model/prompt updates. Detect quality degradation early, optimize costs and ROI tracking. Long-term performance benchmarking with data-driven decisions about tooling.

# Export results for time-series analysis
post_evaluation:
  - name: database
    config:
      type: json-file
      output_path: ./results-history.jsonl
      append: true

Cross-Test Comparison

Standardize agent assessment across your team with shared benchmark configs. Identify hardest tasks, recognize patterns for common failure modes, and aggregate metrics (pass rate, similarity scores, costs).

CI/CD Integration

Run benchmarks in CI/CD pipelines to catch agent regressions before deployment. Automate quality checks for AI-generated code with webhook notifications for Slack, Teams, or custom integrations.

# GitHub Actions integration
- run: npm install -g youbencha
- run: yb run -c suite.yaml
- run: |
    FAILED=$(jq '.summary.failed' results.json)
    [ "$FAILED" -eq 0 ] || exit 1