- Did the agent complete the user’s requested task?
- Did the response contain unsupported claims?
- Did the system follow the required escalation policy?
- Did the extraction output include the required fields?
- Did the workflow fail due to missing context, bad tool use, or model reasoning?
Evals
Where to Start
How to choose a first eval that is narrow enough to build and useful enough to matter.
Start with the smallest eval that can change a decision. For most teams, that means defining a narrow reliability question and building a repeatable way to measure it.
Examples: