A developer-first CLI framework for testing AI-powered coding tools. Run agents, measure their output, and get objective insights—all through a simple command-line interface.
From zero to first evaluation in under 5 minutes
Install globally via npm or use npx for one-time runs.
npm install -g youbencha
# Verify installation
yb --version Create a starter configuration file with helpful comments, then customize for your needs.
# Create suite.yaml with helpful comments
yb init Execute the evaluation suite and generate a comprehensive report.
yb run -c suite.yaml
yb report --from .youbencha-workspace/run-*/artifacts/results.json Essential commands and links for working with youBencha
yb init Create a starter configuration file with helpful comments
yb init [--force] yb run Execute an evaluation suite against an AI agent
yb run -c suite.yaml [--delete-workspace] yb report Generate human-readable report from evaluation results
yb report --from results.json [--format markdown|json] yb validate Validate suite configuration without running evaluation
yb validate -c suite.yaml [-v] yb list List available built-in evaluators and their descriptions
yb list yb suggest-suite Generate evaluation suite suggestions using AI
yb suggest-suite --agent copilot-cli --output-dir ./output Objective evaluation for AI coding agents
Pluggable adapter system works with any AI agent. Start with GitHub Copilot CLI today, switch to other agents tomorrow. No vendor lock-in.
Evaluate beyond 'does it compile?' with git-diff for scope analysis, expected-diff for similarity scoring, and agentic-judge for AI-powered quality assessment.
Safe, repeatable evaluations in isolated workspaces. Complete execution logs in youBencha Log format. Never modifies your working directory.
Simple YAML configuration, one-command execution (yb run), and comprehensive reporting. Built by developers, for developers.
Pre-execution hooks for environment setup, custom evaluators for domain-specific criteria, and post-evaluation hooks for database export and notifications.
Real-world scenarios where youBencha delivers value
AI engineers get immediate feedback during prompt engineering. Debug agent failures with full context (logs, diffs, metrics). Iterate quickly on agent configurations and track token usage and costs per evaluation.
# Edit prompt, run evaluation, review results, iterate
yb run -c suite.yaml
yb report --from .youbencha-workspace/run-*/artifacts/results.json Development teams can compare different models (GPT-5, Claude Sonnet, Gemini) on identical tasks. Identify which tasks are hardest for agents, aggregate metrics across test suites, and make data-driven tool decisions.
# Compare models on same task
agent:
type: copilot-cli
model: claude-sonnet-4.5 # Or gpt-5, gemini-3-pro-preview Organizations track performance across model/prompt updates. Detect quality degradation early, optimize costs and ROI tracking. Long-term performance benchmarking with data-driven decisions about tooling.
# Export results for time-series analysis
post_evaluation:
- name: database
config:
type: json-file
output_path: ./results-history.jsonl
append: true Standardize agent assessment across your team with shared benchmark configs. Identify hardest tasks, recognize patterns for common failure modes, and aggregate metrics (pass rate, similarity scores, costs).
Run benchmarks in CI/CD pipelines to catch agent regressions before deployment. Automate quality checks for AI-generated code with webhook notifications for Slack, Teams, or custom integrations.
# GitHub Actions integration
- run: npm install -g youbencha
- run: yb run -c suite.yaml
- run: |
FAILED=$(jq '.summary.failed' results.json)
[ "$FAILED" -eq 0 ] || exit 1