Running Evaluations
Run an Evaluation
Section titled “Run an Evaluation”agentv eval evals/my-eval.yamlResults are written to .agentv/results/eval_<timestamp>.jsonl.
Common Options
Section titled “Common Options”Override Target
Section titled “Override Target”Run against a different target than specified in the eval file:
agentv eval --target azure_base evals/**/*.yamlRun Specific Test
Section titled “Run Specific Test”Run a single test by ID:
agentv eval --test-id case-123 evals/my-eval.yamlDry Run
Section titled “Dry Run”Test the harness flow with mock responses (does not call real providers):
agentv eval --dry-run evals/my-eval.yamlOutput to Specific File
Section titled “Output to Specific File”agentv eval evals/my-eval.yaml --out results/baseline.jsonlTrace Persistence
Section titled “Trace Persistence”Persist full execution traces (tool calls, timing, spans) to disk for debugging and analysis:
agentv eval evals/my-eval.yaml --traceTraces are written to .agentv/traces/<timestamp>/<eval-file>.trace.jsonl as JSONL records containing:
testId- The test identifierstartTime/endTime- Execution boundariesdurationMs- Total execution durationspans- Array of tool invocations with timing and input/outputtokenUsage/costUsd- Resource consumption
Workspace Cleanup
Section titled “Workspace Cleanup”When using workspace_template or the workspace config block, temporary workspaces are created for each test. By default, workspaces are cleaned up on success and preserved on failure for debugging.
# Always keep workspaces (for debugging)agentv eval evals/my-eval.yaml --keep-workspaces
# Always cleanup workspaces (even on failure)agentv eval evals/my-eval.yaml --cleanup-workspacesWorkspaces are stored at ~/.agentv/workspaces/<eval-run-id>/<test-id>/.
Validate Before Running
Section titled “Validate Before Running”Check eval files for schema errors without executing:
agentv validate evals/my-eval.yamlAgent-Orchestrated Evals
Section titled “Agent-Orchestrated Evals”Run evaluations without API keys by letting an external agent (e.g., Claude Code, Copilot CLI) orchestrate the eval pipeline.
Overview
Section titled “Overview”agentv prompt eval evals/my-eval.yamlOutputs a step-by-step orchestration prompt listing all tests and the commands to run for each.
Get Task Input
Section titled “Get Task Input”agentv prompt eval input evals/my-eval.yaml --test-id case-123Returns JSON with:
input—[{role, content}]array. File references use absolute paths ({type: "file", path: "/abs/path"}) that the agent can read directly from the filesystem.guideline_paths— files containing additional instructions to prepend to the system message.criteria— grading criteria for the orchestrator’s reference (do not pass to the candidate).
Judge the Result
Section titled “Judge the Result”agentv prompt eval judge evals/my-eval.yaml --test-id case-123 --answer-file response.txtRuns code judges deterministically and returns LLM judge prompts for the agent to execute. Each evaluator in the output has a status:
"completed"— Score is final (e.g., code judge). Readresult.score."prompt_ready"— LLM grading required. Sendprompt.system_promptandprompt.user_promptto your LLM and parse the JSON response.
When to Use
Section titled “When to Use”| Scenario | Command |
|---|---|
| Have API keys, want end-to-end automation | agentv eval |
| No API keys, agent can act as the LLM | agentv prompt |
All Options
Section titled “All Options”Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.