·E211

Why Your Agent Evaluations Will Fail You (and How to Fix Them Before Production)

June 3

44 mins

View Transcript

Episode Description

Anthropic deprecated Sonnet 3.5. Some of Xelix's pipelines migrated smoothly. Others broke — and customers noticed within hours. What separated the two? Evaluation. Paul Solomon and James Price Farr have spent 5+ years building AI systems that process millions of invoices for enterprise customers. In this episode, they share the evaluation-first framework that now saves them every time a model changes, an orchestration layer fails, or an agent picks the wrong tool. Key takeaways: • Evaluation-first, not evaluation-after — Retrofitting evaluation on an agent already in production is painful. Build your eval pipeline before you build the agent. • Monitor tool calls, not just outputs — If the agent isn't selecting the right tools, nothing downstream will be correct. Tool-call monitoring is your leading indicator. • 3 tiers of automation — Not everything needs an agent. Rules-based → single LLM call → agentic system. Pick the simplest tier that solves the problem. • Extended thinking tames token explosion — After migrating to newer, more verbose models, enabling extended thinking (with a budget) moved reasoning out of expensive output tokens and brought costs back under control. • Human-in-the-loop by default — Start with human review on every output, then earn trust toward touchless automation as customers gain confidence. • Pragmatism wins — Use whatever technology works best for the problem. Not every feature needs an LLM. Recorded live at AWS Summit London.

With Paul Solomon, Head of AI Engineering at Xelix ; With James Price Farr, AI Engineering Team Lead at Xelix

- Xelix — AI-Powered Accounts Payable Platform
- Strands Agents SDK — Open Source
- Amazon Bedrock — Managed LLM Inference
- Amazon Bedrock AgentCore
- Strands Agents — Steering Files and Hooks for Agent Accuracy (Claire Liguori)
- Amazon SageMaker
- Fast.ai — Practical Deep Learning Courses (Book Recommendation)
- The Fifth Risk — Michael Lewis (Book Recommendation)
- Neurosymbolic AI and Automated Reasoning on AWS
- Kiro — AI-Powered Development Environment

See all episodes

Why Your Agent Evaluations Will Fail You (and How to Fix Them Before Production)

View Transcript

Episode Description

Never lose your place, on any device