A flexible evaluation framework for running automated tests and benchmarks against Forge Code commands.
Before running evaluations, create a forgee symlink to the debug binary:
# Create symlink in your PATH (e.g., ~/bin or /usr/local/bin)
ln -sf /path/to/code-forge/target/debug/forge ~/forgee
# Or if ~/bin is in your PATH
ln -sf $(pwd)/target/debug/forge ~/bin/forgeeWhy is this needed? Tasks execute in temporary directories, so relative paths like ../../target/debug/forge won't work. The forgee symlink provides a stable absolute path that works from any directory.
# Run an evaluation
npm run eval ./evals/create_skill/task.yml
# Set custom log level
LOG_LEVEL=debug npm run eval ./evals/create_skill/task.ymlThe evaluation system executes commands based on task definitions and validates their output. It supports:
- Parallel execution with configurable concurrency
- Timeout handling for long-running tasks
- Data-driven testing using CSV files
- Output validation using regex patterns
- Debug artifacts stored with timestamps
benchmarks/
├── cli.ts # Main CLI entry point
├── command-generator.ts # Command template rendering
├── task-executor.ts # Task execution with timeout support
├── model.ts # TypeScript types for tasks
├── parse.ts # CLI argument parsing
└── evals/ # Evaluation definitions
└── create_skill/
├── task.yml # Task definition
├── create_skill_tasks.csv # Test data
└── debug/ # Debug outputs (timestamped)
mkdir -p evals/my_evalThe task file defines how to run your evaluation:
# Optional: Commands to run before evaluation starts
before_run:
- cargo build
- npm install
# Required: Command(s) to execute for each test case
# Single command
run: forgee -p '{{prompt}}'
# Or multiple commands (executed sequentially)
run:
- echo "Step 1: {{task}}"
- forgee -p '{{prompt}}'
- echo "Step 2: Complete"
# Execution configuration
parallelism: 10 # Number of tasks to run in parallel (default: 1)
timeout: 60 # Timeout in seconds (optional)
early_exit: true # Stop execution when validations pass (optional)
# Optional: Validations to run on output
validations:
- name: "Check success message"
type: regex
regex: \[[0-9:]*\] Skill create-skill
# Required: Data sources for test cases
sources:
- csv: my_tasks.csvbefore_run (optional): Array of shell commands to execute before running tasks
- Runs sequentially before the main command execution
- Executes in a temporary directory created for the evaluation run
- Useful for building binaries or setting up dependencies
run (required): Command(s) to execute for each test case
- Can be a single string or an array of strings
- Commands support template placeholders (e.g.,
{{variable}}) - Multiple commands are executed sequentially
- If any command fails, subsequent commands are skipped
parallelism (optional): Number of tasks to run concurrently (default: 1)
timeout (optional): Maximum execution time in seconds per task
early_exit (optional): Stop command execution when all validations pass
validations (optional): Array of validation rules
name: Human-readable descriptiontype: Validation type. Supported values:regex: Match output against a regular expression patternshell: Execute a shell command with output as stdin
- For
regextype:regex: Regular expression pattern to match in output
- For
shelltype:command: Shell command to execute (receives task output via stdin)exit_code: Expected exit code (default: 0)
sources (required): Array of data sources
- Currently supports CSV files:
- csv: filename.csv - Future: Command output:
- cmd: command
Create a CSV file with columns matching your template variables:
prompt,expected_output
"Create a backup script","backup.sh"
"Generate API client","api_client.py"The column names (e.g., prompt, expected_output) become template variables you can use in your command with {{column_name}}.
npm run eval ./evals/my_eval/task.ymlCommands support Handlebars template syntax. Variables are populated from CSV columns:
run:
command: ./tool --input '{{input_file}}' --format {{format}} --verboseWith CSV:
input_file,format
data1.txt,json
data2.txt,yamlEach run creates a timestamped debug directory:
evals/my_eval/debug/2025-11-23T10-30-45-123Z/
├── task-1.log
├── task-2.log
└── task-3.log
Each log file contains the full output (stdout + stderr) from the command execution.
Tasks can have four statuses:
passed: Task completed successfully and passed all validationsvalidation_failed: Task completed but failed one or more validationstimeout: Task exceeded the timeout limitfailed: Task execution failed (non-zero exit code)
Human-readable output (default):
npm run eval ./evals/my_eval/task.ymlMachine-readable JSON output:
LOG_JSON=1 npm run eval ./evals/my_eval/task.yml | jq .Debug logging:
LOG_LEVEL=debug npm run eval ./evals/my_eval/task.ymlrun: echo "Processing {{name}}"
sources:
- csv: names.csvname
Alice
Bob
Charlierun: ./slow_task --id {{task_id}}
parallelism: 5
timeout: 30
sources:
- csv: tasks.csvrun:
- echo "Starting task {{id}}"
- ./process --input {{file}}
- echo "Task {{id}} complete"
parallelism: 3
timeout: 120
sources:
- csv: tasks.csvrun: echo "{{message}}"
parallelism: 3
validations:
# Using grep to check if output contains specific text
- name: "Contains 'test' word"
type: shell
command: grep -q "test"
exit_code: 0
# Count words and ensure it's greater than 2
- name: "More than 2 words"
type: shell
command: test $(wc -w | awk '{print $1}') -gt 2
exit_code: 0
# Traditional regex validation (for comparison)
- name: "Contains test or validation"
type: regex
regex: "test|validation"
sources:
- csv: messages.csvrun: cargo test {{test_name}}
validations:
- name: "All tests passed"
type: regex
regex: test result:\s+ok
sources:
- csv: tests.csv-
Use quotes in commands: When passing CSV values with spaces, wrap them in quotes:
command: forge -p '{{prompt}}'
-
Build before running: Use
before_runto ensure binaries are up-to-date:before_run: - cargo build
-
Start with low parallelism: Test with
parallelism: 1first, then increase:parallelism: 1 # Start here
-
Set appropriate timeouts: Add timeouts to prevent hanging:
timeout: 60 # seconds
-
Check debug logs: When tasks fail, check the debug directory for full output:
cat evals/my_eval/debug/*/task-1.log
The CLI exits with:
- 0: All tasks passed
- 1: One or more tasks failed (excluding timeouts and validation failures)