Bencha mascot youBencha
Confident developer benchmarking AI coding agents

Benchmark AI Coding Agents with Confidence

Stop guessing which agent works best. Get reproducible, objective measurements for GitHub Copilot CLI (coming soon: Claude Code, Codex, and more)

Get Started in 3 Steps

From zero to benchmark in under 5 minutes

1

Install youBencha

Install globally via npm. Requires Node.js 20+ and Git.

npm install -g @youbencha/youbencha-cli
2

Create Config

Define your benchmark in a YAML file. Specify the repo, agent, and evaluators.

copilot-suite-general-agent.yml
repo: https://github.com/octocat/Hello-World.git
branch: master

agent:
  type: copilot-cli
  config:
    prompt: "Add a comment to README explaining what this repository is about"

evaluators:
  - name: git-diff
  - name: agentic-judge
    config:
      criteria:
        readme_modified: "README.md was modified. Score 1 if true, 0 if false."
        helpful_comment_added: "A helpful comment was added to README.md. Score 1 if true, 0 if false."
        grammatically_correct: "The comment added to README.md is grammatically correct. Score 1 if true, 0 if false."
3

Run Benchmark

Execute the benchmark and review results in your terminal.

yb run -c examples/copilot-suite-general-agent.yml --keep-workspace

Requirements

  • Node.js 20+
  • Git installed
  • AI coding agent (GitHub Copilot CLI, Claude Code, Codex, etc.)
  • No additional auth setup needed - CLI runs in your local context

Why youBencha?

Objective evaluation for AI coding agents

Agent-Agnostic

Works with GitHub Copilot CLI (coming soon Claude Code, Codex), and any agent via custom adapters. No vendor lock-in.

Flexible Evaluation

Three built-in evaluators: git-diff for code changes, expected-diff for reference comparisons, and agentic-judge for nuanced assessment.

Reproducible Results

Version-controlled YAML configs and deterministic benchmarks eliminate guesswork and enable team collaboration.

Developer-Friendly CLI

Simple YAML configuration, one-command execution (yb run), and clear terminal output. Built by developers, for developers.

Automated Setup

Automatic adapter detection and evaluator suggestions based on your repo. Get started in minutes, not hours.

GitHub Stars
Open Source
MIT License
Community
Growing Active

Use Cases

Real-world scenarios where youBencha delivers value

Prompt Engineering

Test prompt variations against real code tasks to find what works best for your domain. Compare results objectively and iterate quickly.

yb run --prompt-variant="detailed" --compare

Model Comparison

Evaluate different models (GPT-4, Claude, Gemini) on identical tasks to inform model selection and cost optimization.

Regression Testing

Detect when agent updates degrade performance on critical code generation tasks. Catch regressions before they impact your workflow.

Team Evaluation

Standardize agent assessment across your team with shared benchmark configs. Make data-driven tool decisions together.

CI Integration

Run benchmarks in CI/CD pipelines to catch agent regressions before deployment. Automate quality checks for AI-generated code.

# .github/workflows/benchmark.yml
- name: Run youBencha
  run: yb run --ci