Custom Evaluators

AgentV supports multiple evaluator types that can be combined for comprehensive evaluation.

Evaluator Types

Type	Description	Use Case
`code_judge`	Deterministic script (Python/TS/any)	Exact matching, format validation, programmatic checks
`llm_judge`	LLM-based evaluation with custom prompt	Semantic evaluation, nuance, subjective quality
`rubrics`	Structured rubric evaluator via `assert`	Multi-criterion grading with weights

Referencing Evaluators

Evaluators are configured using assert — either top-level (applies to all tests) or per-test:

Top-Level (Default for All Tests)

description: My evaluation
assert:
  - name: correctness
    type: llm_judge
    prompt: ./judges/correctness.md

tests:
  - id: test-1
    # Uses the top-level evaluator
    ...

Per-Case Override

tests:
  - id: test-1
    criteria: Returns valid JSON
    input: Generate a JSON config
    assert:
      - name: json_check
        type: code_judge
        script: ./validators/check_json.py

Combining Evaluators

Use multiple evaluators on the same case for comprehensive scoring:

tests:
  - id: code-generation
    criteria: Generates correct Python code
    input: Write a sorting function
    assert:
      - type: rubrics
        criteria:
          - Code is syntactically valid
          - Handles edge cases (empty list, single element)
          - Uses appropriate algorithm
      - name: syntax_check
        type: code_judge
        script: ./validators/check_syntax.py
      - name: quality_review
        type: llm_judge
        prompt: ./judges/code_quality.md

Each evaluator produces its own score. Results appear in scores[] in the output JSONL.

For multiple evaluators in assert, the test score is the weighted mean:

final_score = sum(score_i * weight_i) / sum(weight_i)

If weight is omitted, it defaults to 1.0 (equal weighting). If any evaluator has required: true (or required: <threshold>) and scores below its required threshold, the overall test score is forced to 0.

Best Practices

Use code judges for deterministic checks — exact value matching, format validation, schema compliance
Use LLM judges for semantic evaluation — meaning, quality, helpfulness
Use rubrics for structured multi-criteria grading — when you need weighted, itemized scoring
Combine evaluator types for comprehensive coverage
Test code judges locally before running full evaluations