Skip to main content
We recommend building judges comprised of the following components and properties:
ComponentDescriptionExampleGuidance
ModelA strong, instruction-tuned LLM.GPT-5.4-mini, Gemma-4-31B.Do not overthink the choice of model. Most modern LLMs are strong instruction-followers, so any foundation model of sufficient size (we recommend at least 30B total parameters as of this writing) should be able to handle a well-defined judge task. Choose something within the latency and cost budget your application requires.
ContextTypically a strong system prompt, with no fine-tuning.”You are evaluating the outputs of another AI model. Your job is to determine if it helped the customer return their order successfully. Evaluate based on three components…”We recommend against manual prompt engineering to build judges. Use human annotations and an automated prompt optimization tool to automatically build a strong system prompt for the judge model you have selected.
InputIf used for evals, typically a single user conversation with the model, including inputs and outputs, or an agent trace. If used for other purposes, typically one record of the unstructured or semi-structured data being analyzed.User: “Can you help me return order ABC12345?” Model: “I would be happy to help. Can you provide confirmation of delivery and the address it was delivered to?”Make sure to provide all necessary information to a judge, and do not hide evidence that would be useful in making a decision. You can optionally supplement a judge with web search or other external grounding tools, but these can be hard to audit and highly variable in pulling in necessary information.
Output SchemaA decision label, ideally binary or ternary, and a rationale.{"rationale": "The model asked for all three required components to assist the user with their return.", "label": "pass"}Provide an output schema with rationale first, then label second. Frame the task as a single-label classification problem with as few options as possible. Binary or ternary label sets are ideal. Avoid numerical scores when possible; if needed, use a 1-5 Likert scale. Do not ask the model for a confidence score.