Google Unveils Inference TPU Amid AI Agent Boom

TL;DR

Google TPU 8i: Google plans to unveil its eighth-generation Tensor Processing Units (TPUs), with the TPU 8i customized for AI inference amid rising agent demand.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)
20-50x Transactions: A single AI agent request generates 20 to 50 times more inference transactions than a chatbot query, hitting memory walls without enough capacity.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)
Inference Market Growth: Inference could match or exceed training in size, as it is needed to offset training costs and cut latency for users.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

The story at a glance

Google is introducing its eighth-generation TPUs this week at a Las Vegas event, including the first inference-specific version called TPU 8i and a separate training chip called TPU 8t. The company, partnering with Broadcom, has tested the inference chip with AI firms after years of development. Executives from Google Cloud, including CEO Thomas Kurian and VP Mark Lohmeyer, highlight surging demand from AI agents. This comes as inference workloads grow due to agentic AI performing tasks like software writing.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

Key points

Google developed the TPU 8i for inference—querying trained AI models—and TPU 8t for training, addressing needs for more memory and lower latency over raw power.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)
AI agents create 20-50 times more inference transactions per request than chatbots, risking a "memory wall" bottleneck without fast data access.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)
Google has designed TPUs for over a decade with Broadcom; last year it began selling seventh-generation Ironwood TPUs.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)
Deals include access to about one million TPUs for Anthropic and another with Meta Platforms; interest grows for use outside Google Cloud.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)
Rivals: Nvidia leads with GPUs but launched an inference server using Groq tech from a $20 billion deal; Cerebras partnered with AWS and filed for IPO.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

Details and context

Google's TPUs started in data-center cloud servers and now power models like Gemini chatbots and Nano Banana image generator. Inference differs from training by needing less compute power but more memory to avoid delays in responses, especially as agents chain multiple actions.

The company promoted Amin Vahdat to chief technologist of AI infrastructure, overseeing TPUs and models, reporting to Alphabet CEO Sundar Pichai. Customers want compute closer to workloads, beyond just Google Cloud.

This sharpens competition with Nvidia, whose GPUs excel at training's parallel tasks but less so at inference's memory demands.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

Key quotes

“If you don’t have inference, you cannot cover the cost of your training. So eventually inference is going to be at least as big, if not bigger, than the training market.” — Thomas Kurian, chief executive of Google Cloud.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

“What the customers really care about is how can you drive down the latency.” — Mark Lohmeyer, vice president of AI and computing infrastructure for Google Cloud.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

Why it matters

Inference demand from AI agents is exploding, shifting chip design from training-focused power to memory and speed for real-world use. Businesses and AI developers gain options to lower costs and latency versus Nvidia's GPUs, with Google expanding TPU sales through Cloud deals. Watch Google's Las Vegas event this week for performance benchmarks and new customer tests, though full commercialization details remain unclear.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

What changed

TPUs previously handled both training and inference in unified designs like last year's seventh-generation Ironwood. Google now splits into specialized TPU 8t for training and TPU 8i for inference to better match workloads. The change is announced ahead of this week's Las Vegas event, after years of development and recent tests.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

FAQ

Q: Why is Google focusing on inference-specific chips now?

A: Demand for inference is exploding as businesses use AI agents for tasks like writing software, generating 20-50 times more transactions per request than chatbots. These workloads hit memory bottlenecks without enough capacity, slowing responses. Google developed the TPU 8i over several years and tested it recently with AI companies.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

Q: How do TPUs differ from Nvidia GPUs for AI?

A: GPUs provide massive parallel power ideal for training but excess for inference, which needs more memory to avoid the "memory wall." TPUs like the new 8i target inference's latency needs directly. Google positions them as efficient alternatives amid growing external deals.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

Q: What are Google's major TPU customer deals?

A: Anthropic gets access to about one million TPUs, and Meta Platforms has another deal. Broadcom partners on designs, including recent ones for Anthropic. Interest rises for on-premises use outside Google Cloud.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)

Q: When is the new TPU announcement?

A: Google plans to unveil the eighth-generation TPUs this week at a company event in Las Vegas. The article updated April 22, 2026, ahead of the print edition on April 23. Testing with AI firms happened in recent months.[[1]](https://www.wsj.com/tech/ai/google-tpux-inference-chip-7930f2d0)