LLM-as-judge

An evaluation pattern in which a language model scores or labels outputs (sometimes from other LLMs), used to scale grading of failure modes or task success but subject to its own biases and calibration issues.

In this vault

Backlinks