“We went from extremely slow updates that were very manual to very fast updates that are very manual. And I think we’re going to move towards even faster updates that are partially or even entirely automatic.”

https://greylock.com/wp-content/uploads/2025/07/ankur-goyal.jpg

Ankur Goyal

Founder and CEO
https://greylock.com/wp-content/uploads/2025/07/Braintrust-logo-black-1.svg

Evals are the backbone of AI development, helping teams understand not just how well a model performs, but whether it’s actually doing what it needs to do in the real world.

So how should engineers approach evaluating agents as part of building effective agentic products?

In this conversation, Ankur Goyal (Braintrust) shares his thoughts on the evolution from traditional software to agents and what’s changed, and outlines two types of evals (end-to-end and individual step evaluation) that teams building agentic AI should use.

He discusses how to get started with evals using simple prototypes, and how to improve evals by thinking through outputs. He also covers quantitative versus qualitative metrics, and scoring functions using heuristics.

And finally, Ankur shares his thoughts on how the best in-class teams manage evals, and what the evals look like for agents.

Watch the Panel Discussion

Corinne Riley

Corinne works with early-stage founders who are creating data and AI products at the infrastructure and application layers.

visually hidden