Bencha mascot youBencha
Confident developer benchmarking AI coding agents

Test-Driven Development for AI Coding Agents

Stop evaluating with vibes. youBencha brings TDD discipline to AI agent development—define expectations first, iterate to green, and prevent regressions with objective, reproducible metrics.

đźš§ Currently in Beta

Get Started in 5 Steps

Apply TDD principles to your AI agent development

1

Install youBencha

Install globally via npm to access the yb CLI anywhere.

npm install -g youbencha
2

Create Your Agent

Start simple. Create a named agent in .github/agents/ with specific instructions. Named agents let you version-control your agent's behavior and reuse it across test cases.

.github/agents/readme-commenter.md
# .github/agents/readme-commenter.md
---
name: readme-commenter
description: Agent that adds helpful comments to README files
---

You are a documentation specialist. When asked to modify a README:
1. Read the existing content carefully
2. Add clear, helpful comments that explain the purpose
3. Preserve existing formatting and structure
4. Keep changes minimal and focused
3

Identify your First Test Case/Repo and Create Your Test Case Configuration

Run yb init to generate a starter configuration, or create your own. Reference your named agent and define expectations before running—this forces you to clearly specify what success looks like, the core TDD principle.

suite.yaml
# Quick start: yb init creates suite.yaml with helpful comments
# Or define your own test case:

name: "Add README Comment"
description: "Agent should add a helpful comment to the README"

repo: https://github.com/youbencha/hello-world.git
branch: main

agent:
  type: copilot-cli
  agent_name: readme-commenter  # Uses .github/agents/readme-commenter.md
  config:
    prompt: "Add a comment at the top of README.md explaining this is a test repository"

evaluators:
  - name: git-diff
  - name: agentic-judge
    config:
      type: copilot-cli
      assertions:
        readme_modified: "README.md was modified. Score 1 if true, 0 if false."
        has_comment: "A comment was added. Score 1 if true, 0 if false."
4

Run and Observe

Execute your test case and review the results of the evaluators. Which assertions passed? Which failed?

yb run -c suite.yaml
5

Iterate to Green

If assertions fail, improve your approach: clarify prompts, add context, or constrain scope. Then run again. This is the TDD loop—iterate until your assertions pass.

# Refine prompt, run again
yb run -c suite.yaml

Requirements

  • âś“ Node.js 20+ (check with: node --version)
  • âś“ Git installed (check with: git --version)
  • âś“ AI coding agent (GitHub Copilot CLI: npm install -g @githubnext/github-copilot-cli)
  • âś— No additional authentication - runs in your local environment context

Quick Reference

Essential commands and links for working with youBencha

CLI Commands

yb init

Create a starter configuration file with helpful comments

yb init [--force]
yb run

Execute an evaluation suite against an AI agent

yb run -c suite.yaml [--delete-workspace]
yb report

Generate human-readable report from evaluation results

yb report --from results.json [--format markdown|json]
yb validate

Validate suite configuration without running evaluation

yb validate -c suite.yaml [-v]
yb list

List available built-in evaluators and their descriptions

yb list

Quick Links

Why youBencha?

TDD principles applied to AI agent evaluation

Red → Green → Improve

Apply the classic TDD cycle to agents: define assertions first, iterate until they pass, then refine context and expand coverage. Reproducible, objective, and scalable.

Assertion-Driven Evaluation

Move beyond 'vibes-based' evaluation. Define clear success criteria with binary, graded, or quantitative assertions that measure task completion, code quality, and instruction following.

Agent-Agnostic Architecture

Pluggable adapter system works with any AI agent. Start with GitHub Copilot CLI today, switch to other agents tomorrow. No vendor lock-in.

Reproducible & Isolated

Safe, repeatable evaluations in isolated workspaces. Pin to specific commits, branches, or repositories. Complete execution logs for debugging agent failures.

Regression Prevention

When you change prompts or configurations, your test suite catches regressions. Build baseline results, compare new runs, and detect quality degradation automatically.

—
GitHub Stars
Open Source
MIT License
Community
Growing Active

Use Cases

TDD for agents in practice

Rapid Prompt Iteration

Define assertions first, run your agent, analyze results. When assertions fail, clarify prompts, add context, or constrain scope. Iterate until green—just like TDD for code.

# The TDD iteration loop
yb run -c testcase.yaml    # Run agent
# Analyze results, adjust prompt
yb run -c testcase.yaml    # Run again until green

Flexible Assertion Patterns

Binary assertions for yes/no checks, graded assertions for partial credit, and quantitative assertions for measurable criteria. Cover task completion, code quality, best practices, and documentation.

assertions:
  task_completed: "Feature implemented. Score 1/0.5/0."
  documentation_quality: "Score 1.0 if full JSDoc, 0.7 if partial."
  tests_pass: "All tests pass. Score 1 if true, 0 if false."

Testing Across Contexts

True agent quality means performing well across varied scenarios. Test on different repositories, branches, and commits. Compare against 'gold standard' reference implementations with expected-diff.

# Pin to exact commit for reproducibility
repo: https://github.com/example/project.git
branch: main
commit: abc123def456

Regression Testing

Build a test suite that protects against regressions. When you change prompts or agent configurations, run your suite to ensure previous capabilities still work. Store baseline results and compare.

# Run regression suite after changes
for config in regression-suite/*.yaml; do
  yb run -c "$config"
done

CI/CD Integration

Integrate youBencha into GitHub Actions or your CI pipeline. Catch agent regressions automatically before deployment. Use post-evaluation hooks for Slack, Teams, or database export.

# GitHub Actions integration
- run: npm install -g youbencha
- run: yb run -c suite.yaml
- run: |
    FAILED=$(jq '.summary.failed' results.json)
    [ "$FAILED" -eq 0 ] || exit 1