Just read the f***ing data
In working with customers deploying AI systems, one thing is often clear: nearly all model behavior problems are actually inference-time data problems.If you see a smart model consistently failing at a certain task it’s typically not because the model is trained poorly, but rather you are supplying bad/missing instructions or data for the model to use as available context.Therefore, the first reflex to improving task performance should just be tearing into a representative underlying cut of data the task is being run on. Many teams will go overboard at the outset: buying observability products, automated monitoring tools, or integrating off-the-shelf eval products. But more reasonably, you should just find a way to get model inputs and outputs into an interface where you and/or a domain expert can review them. Manually reading over just a handful of results will often provide a massive diagnostic lift to start understand where to use scaled approaches.