Custom Evaluators
AgentV supports multiple evaluator types that can be combined for comprehensive evaluation.
Evaluator Types
Section titled “Evaluator Types”| Type | Description | Use Case |
|---|---|---|
code_judge | Deterministic script (Python/TS/any) | Exact matching, format validation, programmatic checks |
llm_judge | LLM-based evaluation with custom prompt | Semantic evaluation, nuance, subjective quality |
rubrics | Structured rubric evaluator via assert | Multi-criterion grading with weights |
Referencing Evaluators
Section titled “Referencing Evaluators”Evaluators are configured using assert — either top-level (applies to all tests) or per-test:
Top-Level (Default for All Tests)
Section titled “Top-Level (Default for All Tests)”description: My evaluationassert: - name: correctness type: llm_judge prompt: ./judges/correctness.md
tests: - id: test-1 # Uses the top-level evaluator ...Per-Case Override
Section titled “Per-Case Override”tests: - id: test-1 criteria: Returns valid JSON input: Generate a JSON config assert: - name: json_check type: code_judge script: ./validators/check_json.pyCombining Evaluators
Section titled “Combining Evaluators”Use multiple evaluators on the same case for comprehensive scoring:
tests: - id: code-generation criteria: Generates correct Python code input: Write a sorting function assert: - type: rubrics criteria: - Code is syntactically valid - Handles edge cases (empty list, single element) - Uses appropriate algorithm - name: syntax_check type: code_judge script: ./validators/check_syntax.py - name: quality_review type: llm_judge prompt: ./judges/code_quality.mdEach evaluator produces its own score. Results appear in scores[] in the output JSONL.
For multiple evaluators in assert, the test score is the weighted mean:
final_score = sum(score_i * weight_i) / sum(weight_i)If weight is omitted, it defaults to 1.0 (equal weighting).
If any evaluator has required: true (or required: <threshold>) and scores below its required threshold, the overall test score is forced to 0.
Best Practices
Section titled “Best Practices”- Use code judges for deterministic checks — exact value matching, format validation, schema compliance
- Use LLM judges for semantic evaluation — meaning, quality, helpfulness
- Use rubrics for structured multi-criteria grading — when you need weighted, itemized scoring
- Combine evaluator types for comprehensive coverage
- Test code judges locally before running full evaluations