Model Selection
youBencha supports multiple AI models for both agent execution and evaluation. Choose the right model for your use case.
Specifying Models
Section titled “Specifying Models”Agent Model
Section titled “Agent Model”Set the model for the main agent:
agent: type: copilot-cli model: claude-sonnet-4.5 config: prompt: "Your task here"Evaluator Model
Section titled “Evaluator Model”Set the model for agentic-judge evaluators:
evaluators: - name: agentic-judge config: type: copilot-cli model: gpt-5 assertions: quality: "Code quality is high. Score 0-1."Supported Models
Section titled “Supported Models”Claude Models
Section titled “Claude Models”| Model | Description | Best For |
|---|---|---|
claude-sonnet-4.5 | Latest Claude | Complex coding tasks |
claude-sonnet-4 | Previous generation | Balanced performance |
claude-haiku-4.5 | Fast, lightweight | Quick evaluations |
GPT Models
Section titled “GPT Models”| Model | Description | Best For |
|---|---|---|
gpt-5 | Latest GPT | General purpose |
gpt-5.1 | Enhanced GPT | Improved reasoning |
gpt-5.1-codex-mini | Code-focused | Simple code tasks |
gpt-5.1-codex | Full Codex | Complex code generation |
Gemini Models
Section titled “Gemini Models”| Model | Description | Best For |
|---|---|---|
gemini-3-pro-preview | Preview model | Experimental use |
Model Selection Strategy
Section titled “Model Selection Strategy”For Agents
Section titled “For Agents”| Use Case | Recommended Model |
|---|---|
| Complex features | claude-sonnet-4.5 |
| Simple changes | claude-haiku-4.5 |
| Code refactoring | gpt-5.1-codex |
| Fast iteration | claude-haiku-4.5 |
For Evaluators
Section titled “For Evaluators”| Use Case | Recommended Model |
|---|---|
| Quality assessment | claude-sonnet-4.5 |
| Security review | gpt-5 |
| Quick checks | claude-haiku-4.5 |
| Cost optimization | claude-haiku-4.5 |
Using Different Models
Section titled “Using Different Models”Same Model for All
Section titled “Same Model for All”agent: type: copilot-cli model: claude-sonnet-4.5 config: prompt: "Implement feature"
evaluators: - name: agentic-judge config: type: copilot-cli model: claude-sonnet-4.5 # Same model assertions: quality: "Code quality. Score 0-1."Different Models
Section titled “Different Models”# Use powerful model for codingagent: type: copilot-cli model: claude-sonnet-4.5 config: prompt: "Implement complex feature"
evaluators: # Use faster model for evaluation - name: agentic-judge config: type: copilot-cli model: claude-haiku-4.5 # Faster, cheaper assertions: basic_quality: "Basic quality check. Score 0-1."Environment-Based Selection
Section titled “Environment-Based Selection”Use environment variables for flexibility:
agent: type: copilot-cli model: ${AGENT_MODEL} config: prompt_file: ./prompts/task.md# Development: fast modelAGENT_MODEL=claude-haiku-4.5 yb run -c suite.yaml
# Production: best modelAGENT_MODEL=claude-sonnet-4.5 yb run -c suite.yamlModel Fallback
Section titled “Model Fallback”Currently, youBencha doesn’t support automatic fallback. If a model is unavailable, the evaluation fails.
Cost Considerations
Section titled “Cost Considerations”| Model | Relative Cost | Speed |
|---|---|---|
claude-haiku-4.5 | $ | Fast |
claude-sonnet-4 | $$ | Medium |
claude-sonnet-4.5 | $$$ | Medium |
gpt-5.1-codex | $$$ | Medium |
Tips:
- Use faster models during development
- Reserve powerful models for final evaluations
- Use lightweight models for frequent CI checks
Best Practices
Section titled “Best Practices”- Start with defaults - Begin without specifying a model
- Benchmark models - Test same task across models
- Match complexity - Use powerful models for complex tasks
- Consider cost - Balance quality vs. budget
- Document choices - Explain model selection in suite config