Confidence Scores
AI teams often ask models to report out confidence scores alongside some prediction. This is usually symptomatic of their concern of consistency or reliability. If you find yourself doing this, it probably means you want to reach for other strategies first to increase consistency. However, reporting out calibrated confidence scores can still be extremely useful, especially when used as an escalation measure or to queue for annotation.Better Sources of Confidence
More useful confidence signals often come from the system around the model:- Agreement: do parallel samples converge on the same answer?
- Evidence: did the model find the facts required to support the answer?
- Verifier checks: does a separate judge, rule, or retrieval check confirm the output?
- Logprobs: you can sometimes rely on cumulative logprobs to measure a models confidence in its result. We won’t go into detail here, because it’s only situationally useful and can conflict with other best-practices we recommend.