Skip to content

Model Selection

youBencha supports multiple AI models for both agent execution and evaluation. Choose the right model for your use case.

Set the model for the main agent:

agent:
type: copilot-cli
model: claude-sonnet-4.5
config:
prompt: "Your task here"

Set the model for agentic-judge evaluators:

evaluators:
- name: agentic-judge
config:
type: copilot-cli
model: gpt-5
assertions:
quality: "Code quality is high. Score 0-1."
ModelDescriptionBest For
claude-sonnet-4.5Latest ClaudeComplex coding tasks
claude-sonnet-4Previous generationBalanced performance
claude-haiku-4.5Fast, lightweightQuick evaluations
ModelDescriptionBest For
gpt-5Latest GPTGeneral purpose
gpt-5.1Enhanced GPTImproved reasoning
gpt-5.1-codex-miniCode-focusedSimple code tasks
gpt-5.1-codexFull CodexComplex code generation
ModelDescriptionBest For
gemini-3-pro-previewPreview modelExperimental use
Use CaseRecommended Model
Complex featuresclaude-sonnet-4.5
Simple changesclaude-haiku-4.5
Code refactoringgpt-5.1-codex
Fast iterationclaude-haiku-4.5
Use CaseRecommended Model
Quality assessmentclaude-sonnet-4.5
Security reviewgpt-5
Quick checksclaude-haiku-4.5
Cost optimizationclaude-haiku-4.5
agent:
type: copilot-cli
model: claude-sonnet-4.5
config:
prompt: "Implement feature"
evaluators:
- name: agentic-judge
config:
type: copilot-cli
model: claude-sonnet-4.5 # Same model
assertions:
quality: "Code quality. Score 0-1."
# Use powerful model for coding
agent:
type: copilot-cli
model: claude-sonnet-4.5
config:
prompt: "Implement complex feature"
evaluators:
# Use faster model for evaluation
- name: agentic-judge
config:
type: copilot-cli
model: claude-haiku-4.5 # Faster, cheaper
assertions:
basic_quality: "Basic quality check. Score 0-1."

Use environment variables for flexibility:

agent:
type: copilot-cli
model: ${AGENT_MODEL}
config:
prompt_file: ./prompts/task.md
Terminal window
# Development: fast model
AGENT_MODEL=claude-haiku-4.5 yb run -c suite.yaml
# Production: best model
AGENT_MODEL=claude-sonnet-4.5 yb run -c suite.yaml

Currently, youBencha doesn’t support automatic fallback. If a model is unavailable, the evaluation fails.

ModelRelative CostSpeed
claude-haiku-4.5$Fast
claude-sonnet-4$$Medium
claude-sonnet-4.5$$$Medium
gpt-5.1-codex$$$Medium

Tips:

  • Use faster models during development
  • Reserve powerful models for final evaluations
  • Use lightweight models for frequent CI checks
  1. Start with defaults - Begin without specifying a model
  2. Benchmark models - Test same task across models
  3. Match complexity - Use powerful models for complex tasks
  4. Consider cost - Balance quality vs. budget
  5. Document choices - Explain model selection in suite config