Eval-driven development?
AI engineering is not that different; the only caveat is that you’re not seeking guarantees. That is the tradeoff we make when reaching for non-deterministic systems. So while the idea of eval-driven development has been proposed, we do not wholeheartedly endorse it quite yet. Why? Because foundation models contain some degree of reliability to start with, perhaps even high enough to ship in a low-stakes setting. Thus, evals should serve the purpose of filling in the remaining gaps, not necessarily defining initial behavior. That said, the process of creating evals to identify model failures is extremely important, so incorporating evals early into the development process is highly encouraged.Why should I care about evals?
So the purpose of evals, in our opinion:- Create a rubric and measurement system from which you can improve an AI system around.
- Use that measurement system to improve the AI system during initial development and after production deployment.
- Is the system reliably doing the task it was built to do?
- What are the common failure modes within my control?
- Are changes to the AI system’s behavior, such as swapping models, changing prompts, or introducing or removing context, improving or hurting performance?
- Is the system reliable enough for the workflow, user, and risk level it serves?