Skip to content

LLM Judges

LLM judges use a language model to evaluate agent responses against custom criteria defined in a prompt file.

Reference an LLM judge in your eval file:

assert:
- name: semantic_check
type: llm_judge
prompt: ./judges/correctness.md

The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.

Write evaluation instructions as markdown. Template variables are interpolated:

# Evaluation Criteria
Evaluate the candidate's response to the following question:
**Question:** {{question}}
**Criteria:** {{criteria}}
**Reference Answer:** {{reference_answer}}
**Candidate Answer:** {{answer}}
## Scoring
Score the response from 0.0 to 1.0 based on:
1. Correctness — does the answer match the expected outcome?
2. Completeness — does it address all parts of the question?
3. Clarity — is the response clear and well-structured?
VariableSource
questionFirst user message content
criteriaTest criteria field
reference_answerLast expected message content
answerLast candidate response content
sidecarTest sidecar metadata
rubricsTest rubrics (if defined)
file_changesUnified diff of workspace file changes (when workspace_template is configured)

For dynamic prompt generation, use the definePromptTemplate function from @agentv/eval:

#!/usr/bin/env bun
import { definePromptTemplate } from '@agentv/eval';
export default definePromptTemplate((ctx) => {
const rubric = ctx.config?.rubric as string | undefined;
return `You are evaluating an AI assistant's response.
## Question
${ctx.question}
## Candidate Answer
${ctx.answer}
${ctx.referenceAnswer ? `## Reference Answer\n${ctx.referenceAnswer}` : ''}
${rubric ? `## Evaluation Criteria\n${rubric}` : ''}
Evaluate and provide a score from 0 to 1.`;
});
  1. AgentV renders the prompt template with variables from the test
  2. The rendered prompt is sent to the judge target (configured in targets.yaml)
  3. The LLM returns a structured evaluation with score, hits, misses, and reasoning
  4. Results are recorded in the output JSONL

When using TypeScript templates, configure them in YAML with optional config data passed to the script:

assert:
- name: custom-eval
type: llm_judge
prompt:
script: [bun, run, ../prompts/custom-evaluator.ts]
config:
rubric: "Your rubric here"
strictMode: true

The config object is available as ctx.config inside the template function.

TypeScript templates receive a context object with these fields:

FieldTypeDescription
questionstringFirst user message content
answerstringLast entry in output
referenceAnswerstringLast entry in expected_output
criteriastringTest criteria field
expectedOutputMessage[]Full resolved expected output
outputMessage[]Full provider output messages
traceTraceSummaryExecution metrics summary
configobjectCustom config from YAML

Template variables are derived internally through three layers:

What users write in YAML or JSONL:

  • input or input — two syntaxes for the same data. input: "What is 2+2?" expands to [{ role: "user", content: "What is 2+2?" }]. If both are present, input takes precedence.
  • expected_output or expected_output — two syntaxes for the same data. expected_output: "4" expands to [{ role: "assistant", content: "4" }]. If both are present, expected_output takes precedence.

After parsing, canonical message arrays replace the shorthand fields:

  • input: TestMessage[] — canonical resolved input
  • expected_output: TestMessage[] — canonical resolved expected output

At this layer, input and expected_output no longer exist as separate fields.

Derived strings injected into evaluator prompts:

VariableDerivation
questionContent of the first user role entry in input
criteriaPassed through from the test field
reference_answerContent of the last entry in expected_output
answerContent of the last entry in output
inputFull resolved input array, JSON-serialized
expected_outputFull resolved expected array, JSON-serialized
outputFull provider output array, JSON-serialized
file_changesUnified diff of workspace file changes (when workspace_template is configured)

Example flow:

# User writes:
input: "What is 2+2?"
expected_output: "The answer is 4"
# Resolved:
input: [{ role: "user", content: "What is 2+2?" }]
expected_output: [{ role: "assistant", content: "The answer is 4" }]
# Derived template variables:
question: "What is 2+2?"
reference_answer: "The answer is 4"
answer: (extracted from provider output at runtime)