Performance Tradeoffs - Sutro Handbook

Balance intelligence and performance for the task at hand

There is no one-size-fits-all guide today to recommend which model to choose, so we recommend basing it on a number of factors. If you’re building an analytical AI system, it typically implies a model that will run the same task, many times. You should be able to optimize that model’s performance against strong evals you’ve built to validate its overall sufficiency. As of this writing, we recommend using models that are at least ~30B total parameters with internal reasoning capabilities unless cost or latency needs prohibit this size. Underneath this size we’ve seen noticeable lapses on out-of-distribution tasks, or weaker inference efficiency (due to longer reasoning traces) which defeat most cost-optimization or latency gains. Above this size, there can often be diminishing returns on quality for well-defined tasks.

What to optimize

Model choice should be evaluated against the production shape of the workload:

Accuracy on representative task cases
Throughput and latency requirements
Cost per successful task completion
Operational control over batching, scaling, and retries
Stability across task variants and out-of-distribution inputs

The mistake is optimizing only one of these dimensions in isolation. A cheaper model that needs longer traces, more retries, or more human review may not actually be cheaper. A larger model that improves quality by a marginal amount may not be worth the latency or serving cost.

Use evals to choose

Model selection should be made against task-specific evals rather than general benchmark impressions. The right question is not “which model is best?” It is “which model is sufficient for this workload under the constraints we actually have?”

​Balance intelligence and performance for the task at hand

​What to optimize

​Use evals to choose

Balance intelligence and performance for the task at hand

What to optimize

Use evals to choose