“We went from extremely slow updates that were very manual to very fast updates that are very manual. And I think we’re going to move towards even faster updates that are partially or even entirely automatic.”
Evals are the backbone of AI development, helping teams understand not just how well a model performs, but whether it’s actually doing what it needs to do in the real world.
So how should engineers approach evaluating agents as part of building effective agentic products?
In this conversation, Ankur Goyal (Braintrust) shares his thoughts on the evolution from traditional software to agents and what’s changed, and outlines two types of evals (end-to-end and individual step evaluation) that teams building agentic AI should use.
He discusses how to get started with evals using simple prototypes, and how to improve evals by thinking through outputs. He also covers quantitative versus qualitative metrics, and scoring functions using heuristics.
And finally, Ankur shares his thoughts on how the best in-class teams manage evals, and what the evals look like for agents.