Security

Security is a critical consideration when running AI agent evaluations. This guide covers youBencha’s security model and best practices.

Security Model

youBencha implements multiple security layers:

Repository validation - Blocks dangerous URLs
Workspace isolation - Isolated execution environments
Controlled execution - Managed command execution
Credentials protection - Never hardcode secrets

Repository Validation

Blocked URLs

youBencha blocks URLs that could enable SSRF attacks:

# ❌ Blocked
repo: http://localhost/repo.git
repo: http://127.0.0.1/repo.git
repo: http://192.168.1.100/repo.git
repo: http://10.0.0.1/repo.git

# ✅ Allowed
repo: https://github.com/org/repo.git
repo: https://gitlab.com/org/repo.git

URL Scheme Requirements

Only secure protocols are allowed:

Scheme	Allowed
`https://`	✅ Yes
`http://`	⚠️ Public only
`file://`	❌ No
`ssh://`	❌ No (for now)

Workspace Isolation

Isolation Guarantees

Fresh clone for each evaluation
Separate directory per run
No cross-contamination between runs
Original repo unchanged

Workspace Location

Default: .youbencha-workspace/run-{timestamp}-{hash}/

All operations are scoped to this directory:

.youbencha-workspace/
└── run-20241115-103000-abc123/
    ├── src-modified/     # Agent works here
    ├── src-expected/     # Reference only
    └── artifacts/        # Output only

Command Execution Security

Pre-Execution Hooks

Scripts run with controlled environment:

pre_execution:
  - name: script
    config:
      command: bash
      args: ["-c", "echo $WORKSPACE_DIR"]
      env:
        CUSTOM_VAR: "value"  # Explicit env vars only

Security measures:

shell: true for flexibility (controlled)
Only specified environment variables passed
Never inherits full process.env
Timeout enforcement

Post-Evaluation Hooks

Post-evaluation hooks:

Run in parallel
Have read-only access to results
Cannot modify workspace
Never fail main evaluation

Credentials Management

Best Practices

Never hardcode secrets:

# ❌ Bad
post_evaluation:
  - name: webhook
    config:
      url: https://hooks.slack.com/services/T00/B00/XXXX

# ✅ Good
post_evaluation:
  - name: webhook
    config:
      url: ${SLACK_WEBHOOK_URL}

Use environment variables:

export SLACK_WEBHOOK_URL="https://hooks.slack.com/..."
export API_TOKEN="secret-token"
yb run -c suite.yaml

CI/CD secrets:

- name: Run Evaluation
  run: yb run -c suite.yaml
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Variable Expansion

Use ${VAR} syntax throughout configuration:

repo: ${REPO_URL}
agent:
  config:
    prompt: "Token: ${API_TOKEN}"
post_evaluation:
  - name: webhook
    config:
      url: ${WEBHOOK_URL}
      headers:
        Authorization: "Bearer ${AUTH_TOKEN}"

Token Scopes

GitHub Token

Minimum required scopes:

repo - Access repositories
read:user - Read user info (for Copilot)

For Copilot CLI:

Active Copilot subscription required
Token with copilot scope

Network Security

Outbound Connections

youBencha makes these external connections:

Purpose	Destination
Clone repo	GitHub/GitLab
AI agent	Copilot API
Webhooks	Your endpoints

Firewall Considerations

Ensure access to:

github.com (repos, API)
api.github.com
copilot-proxy.githubusercontent.com
Your webhook endpoints

Audit Trail

Logging

Every evaluation logs:

Repository URL and branch
Agent configuration (prompts sanitized)
Evaluator results
Execution timing
Hook execution status

Results Retention

Configure result export for audit:

post_evaluation:
  - name: database
    config:
      type: json-file
      output_path: ./audit/evaluations.jsonl
      include_full_bundle: true
      append: true

Security Checklist

Use HTTPS for all repositories
Store secrets in environment variables
Review pre-execution scripts for safety
Limit webhook URLs to trusted endpoints
Set appropriate timeouts
Clean up workspaces regularly
Audit evaluation logs periodically

Reporting Security Issues

Found a security vulnerability?

Email: [email protected]

Please include:

Description of the issue
Steps to reproduce
Potential impact