How (un)reliable are AI agents?

TL;DR

Financial Times article examines the reliability shortcomings of AI agents despite their growing capabilities.
Agents often fail to deliver consistent results on the same task, scoring 30-75% on outcome consistency in benchmarks like GAIA and τ-bench.
Reliability improvements lag far behind accuracy gains, raising doubts about safe deployment in real-world settings.

The story at a glance

A recent Financial Times piece questions how reliable AI agents truly are, highlighting that while they solve tasks more accurately over time, they remain inconsistent and unpredictable across repeated runs. It draws on Princeton research evaluating 14 models from OpenAI, Anthropic and Google on benchmarks such as GAIA for general tasks and τ-bench for customer service simulations. The article is timely amid hype around agentic AI for business use, where failures could carry real costs. This follows months of model releases promising agent advances.[[1]](https://sage.cs.princeton.edu/)[[2]](https://hal.cs.princeton.edu/reliability/)

Key points

AI agents show a capability-reliability gap: accuracy rises steadily with newer models, but overall reliability improves only modestly at +0.21 per year versus +0.27 for accuracy.[[2]](https://hal.cs.princeton.edu/reliability/)
Consistency is weak: even top models like Claude Opus 4.5 score 74% and Gemini 3 Pro 76%, meaning they succeed on a task one run but fail the next; trajectory varies in execution order despite similar actions.
Robustness holds up against technical faults like API errors but drops sharply on rephrased instructions, exposing sensitivity to wording.
Predictability falters: poor calibration means agents overstate confidence, unable to reliably flag their own errors beyond chance levels.
Safety sees gains in avoiding violations, but high-severity issues like financial miscalculations persist in customer service tests.
Evaluated models include GPT-5.2, Claude Opus 4.5 and Gemini 3 Pro, with top reliability at 85% overall—far from production-ready for high-stakes use.

Details and context

Princeton's SAgE group defines reliability across four dimensions—consistency, robustness, predictability and safety—using 12 metrics, as current benchmarks overstate progress by focusing on average accuracy alone. For instance, pass@1 (single try) lags pass@k (multiple tries), showing agents capable in demos but flaky in practice.[[3]](https://www.normaltech.ai/p/new-paper-towards-a-science-of-ai)

This gap stems from agents' probabilistic nature: non-deterministic outputs, compounding errors in multi-step reasoning, and external uncertainties like tool failures. Past events, such as agent mishaps in coding reviews or purchases reported in outlets like Business Insider, illustrate real risks when deployed without safeguards.[[3]](https://www.normaltech.ai/p/new-paper-towards-a-science-of-ai)

Trade-offs appear with scaling: larger models calibrate better but lose consistency. Deployers must add human oversight, circuit breakers and multi-benchmark testing; researchers need reliability indices beyond accuracy.

Key quotes

"They get it right consistently, not right today and wrong tomorrow on the same thing (Consistency)."[[4]](https://www.ft.com/content/52b15e28-e4d2-4694-8f34-f1c30de7e9d8)

Why it matters

AI agents promise to automate complex workflows in business and customer service, but unreliability risks errors with financial or safety consequences. For companies and users, this means sticking to supervised pilots rather than full autonomy, with costs for fixes and lost trust if scaled prematurely. Watch for vendor benchmarks adding reliability metrics and real-world failure reports, though production adoption may stay cautious until consistency nears 90%.[[2]](https://hal.cs.princeton.edu/reliability/)