How (un)reliable are AI agents?

Source: ft.com

TL;DR

The story at a glance

A recent Financial Times piece questions how reliable AI agents truly are, highlighting that while they solve tasks more accurately over time, they remain inconsistent and unpredictable across repeated runs. It draws on Princeton research evaluating 14 models from OpenAI, Anthropic and Google on benchmarks such as GAIA for general tasks and τ-bench for customer service simulations. The article is timely amid hype around agentic AI for business use, where failures could carry real costs. This follows months of model releases promising agent advances.[[1]](https://sage.cs.princeton.edu/)[[2]](https://hal.cs.princeton.edu/reliability/)

Key points

Details and context

Princeton's SAgE group defines reliability across four dimensions—consistency, robustness, predictability and safety—using 12 metrics, as current benchmarks overstate progress by focusing on average accuracy alone. For instance, pass@1 (single try) lags pass@k (multiple tries), showing agents capable in demos but flaky in practice.[[3]](https://www.normaltech.ai/p/new-paper-towards-a-science-of-ai)

This gap stems from agents' probabilistic nature: non-deterministic outputs, compounding errors in multi-step reasoning, and external uncertainties like tool failures. Past events, such as agent mishaps in coding reviews or purchases reported in outlets like Business Insider, illustrate real risks when deployed without safeguards.[[3]](https://www.normaltech.ai/p/new-paper-towards-a-science-of-ai)

Trade-offs appear with scaling: larger models calibrate better but lose consistency. Deployers must add human oversight, circuit breakers and multi-benchmark testing; researchers need reliability indices beyond accuracy.

Key quotes

"They get it right consistently, not right today and wrong tomorrow on the same thing (Consistency)."[[4]](https://www.ft.com/content/52b15e28-e4d2-4694-8f34-f1c30de7e9d8)

Why it matters

AI agents promise to automate complex workflows in business and customer service, but unreliability risks errors with financial or safety consequences. For companies and users, this means sticking to supervised pilots rather than full autonomy, with costs for fixes and lost trust if scaled prematurely. Watch for vendor benchmarks adding reliability metrics and real-world failure reports, though production adoption may stay cautious until consistency nears 90%.[[2]](https://hal.cs.princeton.edu/reliability/)