Skip to content

Execution Metrics

AgentV provides built-in evaluators for checking execution metrics against thresholds. These are useful for enforcing efficiency constraints without writing custom code.

The execution_metrics evaluator provides declarative threshold-based checks on multiple metrics in a single evaluator.

assert:
- name: efficiency
type: execution_metrics
max_tool_calls: 10 # Maximum tool invocations
max_llm_calls: 5 # Maximum LLM calls (assistant messages)
max_tokens: 5000 # Maximum total tokens (input + output)
max_cost_usd: 0.05 # Maximum cost in USD
max_duration_ms: 30000 # Maximum execution duration in ms
target_exploration_ratio: 0.6 # Target ratio of read-only tool calls
exploration_tolerance: 0.2 # Tolerance for ratio check (default: 0.2)
  • Only specified thresholds are checked — omit fields you don’t care about
  • Score is proportional: hits / (hits + misses)
  • Missing data counts as a miss — if you check max_tokens but no token data is available, it fails
  • All thresholds are “max” constraints — values must be ≤ the specified threshold
OptionTypeDescription
max_tool_callsnumberMaximum number of tool invocations
max_llm_callsnumberMaximum LLM calls (counts assistant messages)
max_tokensnumberMaximum total tokens (input + output combined)
max_cost_usdnumberMaximum cost in USD
max_duration_msnumberMaximum execution duration in milliseconds
target_exploration_rationumberTarget ratio of read-only tool calls (0-1)
exploration_tolerancenumberTolerance around target ratio (default: 0.2)
tests:
- id: efficient-research
criteria: Agent researches and summarizes efficiently
input: Research the topic and provide a summary
assert:
- name: efficiency
type: execution_metrics
max_tool_calls: 15
max_llm_calls: 5
max_tokens: 8000
max_cost_usd: 0.10
max_duration_ms: 60000

Check that an agent maintains a good balance between reading (exploration) and writing (action):

assert:
- name: exploration-balance
type: execution_metrics
target_exploration_ratio: 0.6 # 60% should be read-only tools
exploration_tolerance: 0.2 # Allow ±20% variance

For simple single-threshold checks, AgentV also provides dedicated evaluators:

- name: speed
type: latency
max_ms: 5000

Fails if execution duration exceeds the threshold.

- name: budget
type: cost
max_usd: 0.10

Fails if execution cost exceeds the threshold.

- name: tokens
type: token_usage
max_total_tokens: 4000

Fails if total token usage exceeds the threshold.

ScenarioRecommended Evaluator
Check multiple metrics at onceexecution_metrics
Simple single-threshold checklatency, cost, or token_usage
Complex custom formulascode_judge with custom script

Execution metrics work well alongside semantic evaluators:

tests:
- id: code-generation
criteria: Generates correct, efficient code
input: Write a sorting algorithm
assert:
# Semantic quality
- name: quality
type: llm_judge
prompt: ./prompts/code-quality.md
# Efficiency constraints
- name: efficiency
type: execution_metrics
max_tool_calls: 10
max_duration_ms: 30000