Where to Start - Sutro Handbook

Start with the smallest eval that can change a decision. For most teams, that means defining a narrow reliability question and building a repeatable way to measure it. Examples:

Did the agent complete the user’s requested task?
Did the response contain unsupported claims?
Did the system follow the required escalation policy?
Did the extraction output include the required fields?
Did the workflow fail due to missing context, bad tool use, or model reasoning?

Once the first measurement is useful, expand coverage by adding more task-specific checks, judge-backed labels, and production samples.

Eval Approaches Static Evals vs. Judges