LLM Judges

LLM judges use a language model to evaluate agent responses against custom criteria defined in a prompt file.

Configuration

Reference an LLM judge in your eval file:

assert:
  - name: semantic_check
    type: llm_judge
    prompt: ./judges/correctness.md

Prompt Files

The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.

Markdown Template

Write evaluation instructions as markdown. Template variables are interpolated:

# Evaluation Criteria

Evaluate the candidate's response to the following question:

**Question:** {{question}}
**Criteria:** {{criteria}}
**Reference Answer:** {{reference_answer}}
**Candidate Answer:** {{answer}}

## Scoring

Score the response from 0.0 to 1.0 based on:
1. Correctness — does the answer match the expected outcome?
2. Completeness — does it address all parts of the question?
3. Clarity — is the response clear and well-structured?

Available Template Variables

Variable	Source
`question`	First user message content
`criteria`	Test `criteria` field
`reference_answer`	Last expected message content
`answer`	Last candidate response content
`sidecar`	Test `sidecar` metadata
`rubrics`	Test `rubrics` (if defined)
`file_changes`	Unified diff of workspace file changes (when `workspace_template` is configured)

TypeScript Template

For dynamic prompt generation, use the definePromptTemplate function from @agentv/eval:

#!/usr/bin/env bun
import { definePromptTemplate } from '@agentv/eval';

export default definePromptTemplate((ctx) => {
  const rubric = ctx.config?.rubric as string | undefined;

  return `You are evaluating an AI assistant's response.

## Question
${ctx.question}

## Candidate Answer
${ctx.answer}

${ctx.referenceAnswer ? `## Reference Answer\n${ctx.referenceAnswer}` : ''}

${rubric ? `## Evaluation Criteria\n${rubric}` : ''}

Evaluate and provide a score from 0 to 1.`;
});

How It Works

AgentV renders the prompt template with variables from the test
The rendered prompt is sent to the judge target (configured in targets.yaml)
The LLM returns a structured evaluation with score, hits, misses, and reasoning
Results are recorded in the output JSONL

Script Configuration

When using TypeScript templates, configure them in YAML with optional config data passed to the script:

assert:
  - name: custom-eval
    type: llm_judge
    prompt:
      script: [bun, run, ../prompts/custom-evaluator.ts]
      config:
        rubric: "Your rubric here"
        strictMode: true

The config object is available as ctx.config inside the template function.

Available Context Fields

TypeScript templates receive a context object with these fields:

Field	Type	Description
`question`	`string`	First user message content
`answer`	`string`	Last entry in `output`
`referenceAnswer`	`string`	Last entry in `expected_output`
`criteria`	`string`	Test `criteria` field
`expectedOutput`	`Message[]`	Full resolved expected output
`output`	`Message[]`	Full provider output messages
`trace`	`TraceSummary`	Execution metrics summary
`config`	`object`	Custom config from YAML

Template Variable Derivation

Template variables are derived internally through three layers:

1. Authoring Layer

What users write in YAML or JSONL:

input or input — two syntaxes for the same data. input: "What is 2+2?" expands to [{ role: "user", content: "What is 2+2?" }]. If both are present, input takes precedence.
expected_output or expected_output — two syntaxes for the same data. expected_output: "4" expands to [{ role: "assistant", content: "4" }]. If both are present, expected_output takes precedence.

2. Resolved Layer

After parsing, canonical message arrays replace the shorthand fields:

input: TestMessage[] — canonical resolved input
expected_output: TestMessage[] — canonical resolved expected output

At this layer, input and expected_output no longer exist as separate fields.

3. Template Variable Layer

Derived strings injected into evaluator prompts:

Variable	Derivation
`question`	Content of the first `user` role entry in `input`
`criteria`	Passed through from the test field
`reference_answer`	Content of the last entry in `expected_output`
`answer`	Content of the last entry in `output`
`input`	Full resolved input array, JSON-serialized
`expected_output`	Full resolved expected array, JSON-serialized
`output`	Full provider output array, JSON-serialized
`file_changes`	Unified diff of workspace file changes (when `workspace_template` is configured)

Example flow:

# User writes:
input: "What is 2+2?"
expected_output: "The answer is 4"

# Resolved:
input:    [{ role: "user", content: "What is 2+2?" }]
expected_output: [{ role: "assistant", content: "The answer is 4" }]

# Derived template variables:
question:         "What is 2+2?"
reference_answer: "The answer is 4"
answer: (extracted from provider output at runtime)