Running Evaluations

Run an Evaluation

agentv eval evals/my-eval.yaml

Results are written to .agentv/results/eval_<timestamp>.jsonl.

Common Options

Override Target

Run against a different target than specified in the eval file:

agentv eval --target azure_base evals/**/*.yaml

Run Specific Test

Run a single test by ID:

agentv eval --test-id case-123 evals/my-eval.yaml

Dry Run

Test the harness flow with mock responses (does not call real providers):

agentv eval --dry-run evals/my-eval.yaml

Output to Specific File

agentv eval evals/my-eval.yaml --out results/baseline.jsonl

Trace Persistence

Persist full execution traces (tool calls, timing, spans) to disk for debugging and analysis:

agentv eval evals/my-eval.yaml --trace

Traces are written to .agentv/traces/<timestamp>/<eval-file>.trace.jsonl as JSONL records containing:

testId - The test identifier
startTime / endTime - Execution boundaries
durationMs - Total execution duration
spans - Array of tool invocations with timing and input/output
tokenUsage / costUsd - Resource consumption

Workspace Cleanup

When using workspace_template or the workspace config block, temporary workspaces are created for each test. By default, workspaces are cleaned up on success and preserved on failure for debugging.

# Always keep workspaces (for debugging)
agentv eval evals/my-eval.yaml --keep-workspaces

# Always cleanup workspaces (even on failure)
agentv eval evals/my-eval.yaml --cleanup-workspaces

Workspaces are stored at ~/.agentv/workspaces/<eval-run-id>/<test-id>/.

Validate Before Running

Check eval files for schema errors without executing:

agentv validate evals/my-eval.yaml

Agent-Orchestrated Evals

Run evaluations without API keys by letting an external agent (e.g., Claude Code, Copilot CLI) orchestrate the eval pipeline.

Overview

agentv prompt eval evals/my-eval.yaml

Outputs a step-by-step orchestration prompt listing all tests and the commands to run for each.

Get Task Input

agentv prompt eval input evals/my-eval.yaml --test-id case-123

Returns JSON with:

input — [{role, content}] array. File references use absolute paths ({type: "file", path: "/abs/path"}) that the agent can read directly from the filesystem.
guideline_paths — files containing additional instructions to prepend to the system message.
criteria — grading criteria for the orchestrator’s reference (do not pass to the candidate).

Judge the Result

agentv prompt eval judge evals/my-eval.yaml --test-id case-123 --answer-file response.txt

Runs code judges deterministically and returns LLM judge prompts for the agent to execute. Each evaluator in the output has a status:

"completed" — Score is final (e.g., code judge). Read result.score.
"prompt_ready" — LLM grading required. Send prompt.system_prompt and prompt.user_prompt to your LLM and parse the JSON response.

When to Use

Scenario	Command
Have API keys, want end-to-end automation	`agentv eval`
No API keys, agent can act as the LLM	`agentv prompt`

All Options

Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.