Stop guessing which agent works best. Get reproducible, objective measurements for GitHub Copilot CLI (coming soon: Claude Code, Codex, and more)
From zero to benchmark in under 5 minutes
Install globally via npm. Requires Node.js 20+ and Git.
npm install -g @youbencha/youbencha-cli Define your benchmark in a YAML file. Specify the repo, agent, and evaluators.
repo: https://github.com/octocat/Hello-World.git
branch: master
agent:
type: copilot-cli
config:
prompt: "Add a comment to README explaining what this repository is about"
evaluators:
- name: git-diff
- name: agentic-judge
config:
criteria:
readme_modified: "README.md was modified. Score 1 if true, 0 if false."
helpful_comment_added: "A helpful comment was added to README.md. Score 1 if true, 0 if false."
grammatically_correct: "The comment added to README.md is grammatically correct. Score 1 if true, 0 if false." Execute the benchmark and review results in your terminal.
yb run -c examples/copilot-suite-general-agent.yml --keep-workspace Objective evaluation for AI coding agents
Works with GitHub Copilot CLI (coming soon Claude Code, Codex), and any agent via custom adapters. No vendor lock-in.
Three built-in evaluators: git-diff for code changes, expected-diff for reference comparisons, and agentic-judge for nuanced assessment.
Version-controlled YAML configs and deterministic benchmarks eliminate guesswork and enable team collaboration.
Simple YAML configuration, one-command execution (yb run), and clear terminal output. Built by developers, for developers.
Automatic adapter detection and evaluator suggestions based on your repo. Get started in minutes, not hours.
Real-world scenarios where youBencha delivers value
Test prompt variations against real code tasks to find what works best for your domain. Compare results objectively and iterate quickly.
yb run --prompt-variant="detailed" --compare Evaluate different models (GPT-4, Claude, Gemini) on identical tasks to inform model selection and cost optimization.
Detect when agent updates degrade performance on critical code generation tasks. Catch regressions before they impact your workflow.
Standardize agent assessment across your team with shared benchmark configs. Make data-driven tool decisions together.
Run benchmarks in CI/CD pipelines to catch agent regressions before deployment. Automate quality checks for AI-generated code.
# .github/workflows/benchmark.yml
- name: Run youBencha
run: yb run --ci