Skip to content

Results & Reporting

youBencha produces comprehensive results that can be analyzed at different scales, from single runs to time-series analysis.

Quick feedback: Did it work? What changed?

Terminal window
yb run -c suite.yaml
yb report --from .youbencha-workspace/run-*/artifacts/results.json

Cross-comparison across multiple test cases:

  • Identify hardest tasks
  • Calculate pass rate
  • Find common failure patterns

Track performance over time:

  • Detect regressions
  • Analyze trends
  • Forecast budgets

Every evaluation produces a results.json:

{
"version": "1.0",
"timestamp": "2024-11-15T10:30:00Z",
"test_case": {
"name": "auth-evaluation",
"description": "Evaluate authentication implementation",
"repo": "https://github.com/example/repo.git",
"branch": "main"
},
"execution": {
"duration_ms": 45200,
"agent_duration_ms": 30000,
"evaluator_duration_ms": 15200
},
"evaluators": [
{
"name": "git-diff",
"status": "passed",
"metrics": {
"files_changed": 3,
"lines_added": 45,
"lines_removed": 12
}
},
{
"name": "agentic-judge",
"status": "passed",
"metrics": {
"assertions": {
"quality": { "score": 0.9, "passed": true }
}
}
}
],
"summary": {
"overall_status": "passed",
"passed": 2,
"failed": 0,
"duration_ms": 45200
}
}
Terminal window
yb report --from results.json

Output:

📊 youBencha Evaluation Report
==============================
Suite: auth-evaluation
Status: ✅ PASSED
Duration: 45.2s
Evaluator Results:
------------------
✅ git-diff
Files changed: 3
Lines added: 45
✅ agentic-judge
quality: 0.9 (PASS)
Terminal window
yb report --from results.json --format json

Calculate overall pass rate:

Terminal window
jq -s '
map(select(.summary.overall_status == "passed")) | length as $passed |
length as $total |
($passed / $total * 100) | floor
' results/*.json

Find evaluators that fail most often:

Terminal window
jq -s '
[.[].evaluators[] | select(.status == "failed") | .name] |
group_by(.) |
map({name: .[0], count: length}) |
sort_by(-.count)
' results/*.json

Calculate mean scores for agentic-judge:

Terminal window
jq -s '
[.[].evaluators[] |
select(.name | startswith("agentic-judge")) |
.metrics.assertions | to_entries[] | .value.score] |
add / length
' results/*.json

Append results to JSONL file:

post_evaluation:
- name: database
config:
type: json-file
output_path: ./history/evaluations.jsonl
append: true
#!/bin/bash
# Compare last two runs
PREV=$(tail -n 2 history.jsonl | head -n 1 | jq '.summary.passed')
CURR=$(tail -n 1 history.jsonl | jq '.summary.passed')
if [ "$CURR" -lt "$PREV" ]; then
echo "⚠️ REGRESSION DETECTED"
echo "Previous: $PREV passed, Current: $CURR passed"
exit 1
fi
Terminal window
# Pass rate over last 10 runs
tail -n 10 history.jsonl | jq -s '
map(.summary.overall_status == "passed") |
map(if . then 1 else 0 end) |
add / length * 100
'
scripts/to-csv.sh
#!/bin/bash
echo "timestamp,suite,status,passed,failed,duration_ms"
jq -r '
[.timestamp, .test_case.name, .summary.overall_status,
.summary.passed, .summary.failed, .summary.duration_ms] |
@csv
' results/*.json
scripts/summary.sh
#!/bin/bash
TOTAL=$(ls results/*.json | wc -l)
PASSED=$(jq -s '[.[] | select(.summary.overall_status == "passed")] | length' results/*.json)
RATE=$((PASSED * 100 / TOTAL))
echo "📊 Weekly Summary"
echo "Total: $TOTAL | Passed: $PASSED | Rate: ${RATE}%"

Using sparklines:

Terminal window
# Pass/fail trend (requires spark)
jq -r '.summary.overall_status' history/*.json | \
sed 's/passed/1/;s/failed/0/' | \
spark

Create data for Grafana/DataDog:

Terminal window
jq -c '{
metric: "youbencha.evaluation",
timestamp: .timestamp,
value: (if .summary.overall_status == "passed" then 1 else 0 end),
tags: {
suite: .test_case.name,
branch: .test_case.branch
}
}' results/*.json
  1. Export early - Start building history from day 1
  2. Consistent naming - Use same suite names for tracking
  3. Alert on regressions - Automate detection
  4. Archive raw results - Keep for deep analysis
  5. Regular reviews - Analyze trends weekly