Results & Reporting
youBencha produces comprehensive results that can be analyzed at different scales, from single runs to time-series analysis.
Result Types
Section titled “Result Types”Single Run
Section titled “Single Run”Quick feedback: Did it work? What changed?
yb run -c suite.yamlyb report --from .youbencha-workspace/run-*/artifacts/results.jsonSuite of Results
Section titled “Suite of Results”Cross-comparison across multiple test cases:
- Identify hardest tasks
- Calculate pass rate
- Find common failure patterns
Time-Series
Section titled “Time-Series”Track performance over time:
- Detect regressions
- Analyze trends
- Forecast budgets
Results Bundle Schema
Section titled “Results Bundle Schema”Every evaluation produces a results.json:
{ "version": "1.0", "timestamp": "2024-11-15T10:30:00Z",
"test_case": { "name": "auth-evaluation", "description": "Evaluate authentication implementation", "repo": "https://github.com/example/repo.git", "branch": "main" },
"execution": { "duration_ms": 45200, "agent_duration_ms": 30000, "evaluator_duration_ms": 15200 },
"evaluators": [ { "name": "git-diff", "status": "passed", "metrics": { "files_changed": 3, "lines_added": 45, "lines_removed": 12 } }, { "name": "agentic-judge", "status": "passed", "metrics": { "assertions": { "quality": { "score": 0.9, "passed": true } } } } ],
"summary": { "overall_status": "passed", "passed": 2, "failed": 0, "duration_ms": 45200 }}Generating Reports
Section titled “Generating Reports”Markdown Report
Section titled “Markdown Report”yb report --from results.jsonOutput:
📊 youBencha Evaluation Report==============================
Suite: auth-evaluationStatus: ✅ PASSEDDuration: 45.2s
Evaluator Results:------------------✅ git-diff Files changed: 3 Lines added: 45
✅ agentic-judge quality: 0.9 (PASS)JSON Report
Section titled “JSON Report”yb report --from results.json --format jsonAnalyzing Results
Section titled “Analyzing Results”Pass Rate
Section titled “Pass Rate”Calculate overall pass rate:
jq -s ' map(select(.summary.overall_status == "passed")) | length as $passed | length as $total | ($passed / $total * 100) | floor' results/*.jsonHardest Tasks
Section titled “Hardest Tasks”Find evaluators that fail most often:
jq -s ' [.[].evaluators[] | select(.status == "failed") | .name] | group_by(.) | map({name: .[0], count: length}) | sort_by(-.count)' results/*.jsonAverage Scores
Section titled “Average Scores”Calculate mean scores for agentic-judge:
jq -s ' [.[].evaluators[] | select(.name | startswith("agentic-judge")) | .metrics.assertions | to_entries[] | .value.score] | add / length' results/*.jsonTime-Series Analysis
Section titled “Time-Series Analysis”Building History
Section titled “Building History”Append results to JSONL file:
post_evaluation: - name: database config: type: json-file output_path: ./history/evaluations.jsonl append: trueDetecting Regressions
Section titled “Detecting Regressions”#!/bin/bash# Compare last two runsPREV=$(tail -n 2 history.jsonl | head -n 1 | jq '.summary.passed')CURR=$(tail -n 1 history.jsonl | jq '.summary.passed')
if [ "$CURR" -lt "$PREV" ]; then echo "⚠️ REGRESSION DETECTED" echo "Previous: $PREV passed, Current: $CURR passed" exit 1fiTrend Analysis
Section titled “Trend Analysis”# Pass rate over last 10 runstail -n 10 history.jsonl | jq -s ' map(.summary.overall_status == "passed") | map(if . then 1 else 0 end) | add / length * 100'Custom Analysis Scripts
Section titled “Custom Analysis Scripts”Export to CSV
Section titled “Export to CSV”#!/bin/bashecho "timestamp,suite,status,passed,failed,duration_ms"jq -r ' [.timestamp, .test_case.name, .summary.overall_status, .summary.passed, .summary.failed, .summary.duration_ms] | @csv' results/*.jsonSlack Summary
Section titled “Slack Summary”#!/bin/bashTOTAL=$(ls results/*.json | wc -l)PASSED=$(jq -s '[.[] | select(.summary.overall_status == "passed")] | length' results/*.json)RATE=$((PASSED * 100 / TOTAL))
echo "📊 Weekly Summary"echo "Total: $TOTAL | Passed: $PASSED | Rate: ${RATE}%"Visualization
Section titled “Visualization”Simple Charts
Section titled “Simple Charts”Using sparklines:
# Pass/fail trend (requires spark)jq -r '.summary.overall_status' history/*.json | \ sed 's/passed/1/;s/failed/0/' | \ sparkExport for Dashboards
Section titled “Export for Dashboards”Create data for Grafana/DataDog:
jq -c '{ metric: "youbencha.evaluation", timestamp: .timestamp, value: (if .summary.overall_status == "passed" then 1 else 0 end), tags: { suite: .test_case.name, branch: .test_case.branch }}' results/*.jsonBest Practices
Section titled “Best Practices”- Export early - Start building history from day 1
- Consistent naming - Use same suite names for tracking
- Alert on regressions - Automate detection
- Archive raw results - Keep for deep analysis
- Regular reviews - Analyze trends weekly