# The Analytical AI Handbook
> A living FAQ to build, measure, optimize, and scale reliable decision models

This is the agent-readable index for the Sutro Analytical AI Handbook.

## Full Guide

- [Full handbook Markdown](/llms-full.txt)

## Pages

- [The Analytical AI Handbook](/index): A living FAQ to build, measure, optimize, and scale reliable decision models
- [Primitives](/primitives): Learn the fundamental building blocks of analytical AI systems.
- [Classifiers](/primitives/classifiers): The most flexible and broadly applicable analytical AI primitive
- [Types of Classifiers](/primitives/classifiers/types-of-classifiers): Common classifier patterns, including binary, multiclass, multilabel, hierarchical, open-set, and ordinal classifiers.
- [Abstention](/primitives/classifiers/abstention): For when "I don't know" is the best decision a model can make
- [Are judges classifiers?](/primitives/classifiers/are-judges-classifiers): Why LLM judges are a special case of classifiers and when they should be treated as a distinct analytical AI primitive.
- [Extractors](/primitives/extractors): Pull relevant fields and spans from unstructured documents
- [Extraction Verification](/primitives/extractors/extraction-verification): Ground-truthing extractors that contain free-form text
- [Good Extractor Design](/primitives/extractors/good-extractor-design): Unfortunately, we''re not talking about Inception
- [Judges](/primitives/judges): A core unit of analytical AI to scale the judgement of a domain expert.
- [Judge Terminology](/primitives/judges/terminology): The core vocabulary used when discussing LLM judges and candidate models.
- [What's in a Judge?](/primitives/judges/anatomy): The model, context, input, and output schema that make up a typical LLM judge.
- [Types of Judges](/primitives/judges/types): Reliability, quality, sentiment, and intent judges in the judge-design hierarchy.
- [Judges in Evals: Flip Your Intuition](/primitives/judges/intuition): First-principles responses to common objections about using LLMs to judge LLMs.
- [Good Task Design Is All You Need](/primitives/judges/task-design): The design knobs that make LLM judges more reliable, measurable, and useful.
- [Patterns](/patterns): Best-practices and battle-tested strategies for analytical AI.
- [Consistency](/patterns/consistency): Boring as a feature
- [Don't be fooled by determinism](/patterns/consistency/determinism): Why absolute determinism is less useful than measured consistency for real-world analytical AI systems.
- [Task Specificity](/patterns/consistency/task-specificity): How specific task instructions, examples, and edge-case guidance make analytical AI systems more consistent.
- [Fine-tuning and RL](/patterns/consistency/fine-tuning-and-rl): When to consider fine-tuning or reinforcement learning for analytical AI tasks, and why prompt optimization should usually come first.
- [Parallel Sampling](/patterns/consistency/parallel-sampling): How parallel model samples and majority voting can improve consistency for repeated analytical AI decisions.
- [Confidence Scores](/patterns/consistency/confidence-scores): How to use confidence signals, agreement checks, and escalation logic without relying on self-reported model confidence.
- [Ensembles](/patterns/consistency/ensembles): How model ensembles can add useful perspectives, and why they are not always the simplest path to consistent AI behavior.
- [Temperature](/patterns/consistency/temperature): How to tune model temperature with evals instead of assuming zero temperature is always best for consistency.
- [Context](/patterns/context): What your model needs to know to get it right.
- [Expert Annotations](/patterns/context/expert-annotation): Model behavior should be grounded in expert-reviewed data, and abstracted into generalized rulesets.
- [Why Expert Annotations Matter](/patterns/context/expert-annotation/why-expert-annotations-matter): Model behavior should be grounded in expert-reviewed data, not guessed at from aggregate benchmarks.
- [What Good Annotations Capture](/patterns/context/expert-annotation/what-good-annotations-capture): Useful expert annotations preserve the model input, model output, expert correction, rationale, and metadata needed to improve behavior over time.
- [Which Cases to Annotate](/patterns/context/expert-annotation/which-cases-to-annotate): Annotation quality depends on choosing cases that expose ambiguity, edge behavior, and the expert judgment the model needs to learn.
- [Evals](/patterns/evals): Patterns for measuring AI system behavior, reliability, and quality before and after release.
- [Evals as Outer Loop](/patterns/evals/outer-loop): How evals fit into AI development and post-deployment monitoring.
- [Eval Approaches](/patterns/evals/approaches): Common AI eval approaches and the role each one plays in measurement.
- [Where to Start](/patterns/evals/where-to-start): How to choose a first eval that is narrow enough to build and useful enough to matter.
- [Static Evals vs. Judges](/patterns/evals/static-evals-vs-judges): Why LLM judges make sense alongside static evals when evaluating AI systems over unbounded input spaces.
- [Deployment](/deployment): Choices to make when your models are ready for action.
- [Batch vs. Real-Time Inference](/deployment/batch-vs-real-time-inference): Faster, cheaper, better
- [Model Selection](/deployment/model-selection): Selecting the right model for the task at hand.
- [Open-Source vs. Closed Models](/deployment/model-selection/open-source-vs-closed): How to think about provider choice, ecosystem control, and model ownership for analytical AI systems.
- [Performance Tradeoffs](/deployment/model-selection/performance-tradeoffs): How to balance intelligence, latency, throughput, cost, and reliability when choosing a model.
- [Routers and Ensembles](/deployment/model-selection/routers-and-ensembles): When routing, escalation, majority voting, or multiple-model approaches are worth the complexity.
- [Architectures](/architectures): Coming soon