Ironwood and Agent Builder spark the inference-first era
Google Cloud made Ironwood TPUs generally available and upgraded Vertex AI Agent Builder. Together they cut latency and cost for real-world agents, setting up 2026 for sub-second loops and safer autonomy by default.

The news that matters
On November 6, 2025, Google Cloud said Ironwood, its seventh generation Tensor Processing Unit, is entering general availability. The pitch is clear. Ironwood is designed for the age of inference. It targets the workloads that actually answer customer questions, drive internal copilots, and run autonomous agents rather than only the giant training runs that make headlines. Google also updated Vertex AI Agent Builder with security, deployment, and observability features that close the gap between a promising prototype and a production agent that your SREs and risk teams can live with. You can see the hardware milestone in Google’s own announcement that Ironwood TPUs are now GA, and the developer velocity story in coverage of the new Agent Builder upgrades.
This combination marks a tipping point. It is no longer just about bigger models. It is about closing the loop time and shaving the cost of every decision an agent takes. Inference first means a production engineer’s checklist comes before a research benchmark. Ironwood pushes the ceiling on throughput per watt while Agent Builder lowers the floor for day one reliability, safety, and speed to ship.
From training first to inference first
For most of 2023 and 2024, teams behaved as if training scale decided everything. The reasoning was simple. If you could afford the biggest cluster, you could ship the most capable model. Reasonable, but incomplete. The last year taught a different lesson inside enterprises. The hardest problems were not getting a model to speak but getting the full agentic system to act safely, quickly, and at a steady cost under real load. That shift aligns with how enterprises are building the agent trust stack to make autonomy auditable and resilient.
Think about an insurance claims agent. A one minute conversation might involve 6 to 10 model calls, 4 tool invocations, a retrieval step, a policy lookup, a fraud score, and two human visible actions. Multiply that by thousands of concurrent users and you quickly discover that your critical path is not a single model call. It is the orchestration of many short, tightly coupled steps. Inference first optimizes that chain end to end.
Hardware and software co design is the lever. The same team that designs the accelerator, the interconnect, and the runtime also designs the observability, the deployment steps, and the safety controls. When those pieces move together, every loop of the agent gets faster and cheaper.
What Ironwood changes in practice
Ironwood is the first Google TPU explicitly optimized for inference at Google Cloud scale. The main implications for production agents are pragmatic:
- Lower tail latency under concurrency. Agents do not fail on average latency. They fail on the 95th percentile when a user is impatient or a workflow fans out into a burst of tool calls. Ironwood’s throughput and on-chip memory bandwidth improvements target those bursts.
- Bigger safe context windows at practical cost. Many agent loops waste time on context juggling. More efficient memory and better performance per chip let you pass richer state without painful trade offs on price or wait time.
- Better power efficiency. That matters if you are building an agent that has to be online at peak hours in multiple regions. Performance per watt has a direct line to your unit economics.
This is not just about running one flashy model. It is about running a steady stream of smaller, coordinated tasks. Ironwood gives you more headroom to keep the loop under a second even as you add grounding, validation, and planning steps that product managers and auditors will ask for.
What the new Agent Builder changes
Agent Builder’s latest updates focus on the swamp between demo and dependable service:
- Prebuilt plugins including a self heal module. When a tool call fails or times out, the plugin can retry with a different plan or fall back to a known safe path. That reduces escalations to humans and the erratic latencies that follow.
- One command deployment via the Agent Development Kit CLI. Moving from local to cloud runtime becomes a single step. That shortens the dead zones where environments drift and issues hide.
- Deeper observability. Dashboards for first token latency, token throughput, error codes, and tool call traces mean you can tune the exact parts of a loop that cause spikes.
- Safety by default. Model Armor screens inputs, tool calls, and model responses for prompt injection and other abuse patterns. Security Command Center helps track agent assets and policies in one place.
These are table stakes for any team that must pass production readiness reviews. They also shift the culture. With retries, safeguards, and traces available out of the box, developers add guardrails without a lot of custom plumbing. The result is a faster path to scale at lower variance.
Why co design collapses latency and cost
An agent loop is a relay race. The same baton moves through planning, retrieval, tool execution, and response. If the track is narrow at any segment, your average speed falls even if your sprinter is world class.
- Compute locality. When the accelerator, networking, and runtime are tuned together, more work stays close to memory. That trims serialization overhead and cross service chatter.
- Streaming start. Hardware that makes first token fast changes perception. Users care about responsiveness more than raw throughput. If the model streams sooner and the tool call pipeline keeps the buffer filled, you can deliver usable answers before the full job completes.
- Predictable retries. A self heal plugin is only valuable if the retry cost is bounded. Co design lets you plan for the worst case and keep it inside a strict latency budget.
The result is a flywheel. Shorter loops reduce abandonment and increase the number of tasks an agent can complete per minute. Higher completion rates let you route more work to automation. That brings down average cost and justifies more investment in optimization.
The 2026 forecast
The next twelve to eighteen months will define the production patterns that stick.
- Sub second agent loops become the baseline
- Most enterprise facing tasks will set a budget of 800 to 1200 milliseconds for an entire loop, not a single model call. You will see agent planners that execute in under 100 milliseconds, retrieval that resolves in under 200 milliseconds, and tool calls that either stream a partial answer or return within 300 to 500 milliseconds.
- The front end will embrace progressive disclosure. Start streaming in under 200 milliseconds. Show partial structured results as tools return. Confirm with users only when it improves accuracy or compliance. This will be especially visible as contact centers cross the chasm to voice-first agents at scale.
- Safer autonomy becomes the default
- Screening and red teaming shift left. Teams will insert policy checks before tools execute and after results return. Model Armor class tools will become a required control in regulated environments.
- Memory gets guardrails. Enterprise memory banks will add retention policies, per field encryption, and opt in write permissions. The agent can read broadly but must request privileges to store.
- Cloud native orchestration patterns win
- State lives in clear places. Teams will standardize on an agent state object that includes user context, turn history, retrieved evidence, and tool outcomes. That state will be versioned and logged for audit.
- Events, not monoliths. The backbone will be event driven. A step emits events that trigger the next step, and everything is observable in a single trace. Fan out and saga patterns replace long blocking calls.
- Multi agent protocols mature. You will see more adoption of agent to agent patterns for composition across vendors and programming languages, with explicit contracts about capabilities and limits. Expect convergence toward a USB-C of agents standard for tools and capabilities.
Build vs buy: a pragmatic playbook
Many teams are choosing between an Nvidia centric path and Google’s vertically integrated path. Both can work. The right choice depends on workloads, constraints, and talent. Use this checklist to make the call with eyes open.
- Define the job your agents must do
- Interaction profile. Are you building real time assistants and copilots with tight latency budgets, or back office agents that can batch and optimize for throughput? If it is real time and global, bias toward hardware and runtimes that are tuned for first token speed and predictable tail behavior.
- Tool intensity. Count tools. An agent calling three to five systems per turn pays for network and serialization overhead at every step. Favor stacks with strong observability and retry primitives.
- Safety and audit. If you must prove that every tool call was screened, executed inside policy, and recorded, prioritize platforms that ship with centralized safety controls and asset inventories.
- Model your unit economics early
- Break down a turn into model tokens, retrieval calls, and tool calls. Assign a latency and cost budget to each part. Identify which parts must be sub second and which can stream or be deferred.
- Use a capacity plan that includes the 95th percentile. Cost and latency blowups hide in tail behavior. Simulate bursty traffic rather than smooth averages.
- Evaluate the Nvidia centric path
- Strengths. Breadth of ecosystem and partner support, wide availability across clouds and on premises, mature tooling for training, specialized libraries for high performance inference on graphics processing units.
- Risks and mitigations. Supply constraints at peak demand, greater integration work to stitch safety, observability, and orchestration, and potential variance across clouds. Mitigate with a small set of reference architectures, a hard standard for traces and metrics, and pre negotiated burst capacity.
- Best fit. Teams with deep platform expertise, heterogeneous workloads that must run across multiple vendors, or a long term plan to balance cloud and on premises for data gravity reasons.
- Evaluate the Google vertically integrated path
- Strengths. Co designed accelerator, interconnect, runtime, and agent platform. One command deployment, opinionated dashboards, and built in safety controls reduce toil and time to value.
- Risks and mitigations. Tighter coupling to one vendor and a steeper learning curve if your team is steeped in graphics processing unit centric workflows. Mitigate with exit plans for specific components, standard interfaces for tools, and periodic game days that validate portability of state and logs.
- Best fit. Teams that prioritize sub second loops for user facing assistants, want safer defaults without a lot of custom glue, and value a single throat to choke for support.
- Make the decision reversible
- Normalize interfaces. Wrap tools behind stable contracts. Keep agent state in a datastore you control. Log traces and metrics in an observability stack that can ingest from multiple runtimes.
- Pilot both paths with the same scenario. Measure first token time, p95 latency, tool failure recovery time, and total cost per resolved task. Choose based on data, not brand.
A concrete rollout plan for the next 60 days
Week 1 to 2: Define the loop
- Write a time budget for a single agent turn. Include planner, retrieval, model call, tool execution, and validation. Define the non negotiable limits and the places where you can stream.
- Map required tools and mock every external dependency. Instrument the mocks so you can measure without waiting for real services to be ready.
Week 3 to 4: Build with guardrails first
- Use the self heal plugin pattern to retry failed tool calls with a new plan or a confirmed fallback procedure. Start with two retries and a strict cut off.
- Enable safety screening on inputs, tool parameters, and outputs. Treat it like input validation in a web service, not an optional filter.
Week 5 to 6: Deploy with observability and load tests
- Use one command deployment to move to a managed runtime. Wire up dashboards for first token latency, 95th percentile latency, token throughput, tool error codes, and retry counts.
- Run chaos tests. Kill a dependency for two minutes. Introduce high latency on a single tool. Confirm that the agent recovers and stays inside the time budget.
Week 7 to 8: Harden for scale
- Move to event driven orchestration if you started with a monolithic server. Persist state between steps to allow safe restarts.
- Add regional failover. Keep the agent stateless where possible and rely on managed stores for memory and long term context.
What to watch as you scale
- First token time. This is the heartbeat of perceived speed. If you are above 200 milliseconds on average for your main user journeys, investigate streaming and context trimming.
- Tool call variance. Most production issues hide in external calls. Instrument retries and backoffs. A self healing agent must be predictable, not heroic.
- Safety drift. Every new tool is a new trust boundary. Treat Model Armor class screening policies as code. Version them and test them.
- Cost per resolved task. Do not optimize for token cost in isolation. Many teams save pennies on text and waste dollars on slow tools and rework.
The bottom line
The story is not just that a new chip is fast or that a framework has new buttons. The story is that hardware and software are finally moving together around the real job of modern agents. Ironwood gives you the headroom to keep loops tight without shedding safety. Agent Builder gives you the controls to deploy, observe, and fix agents like any other service. Together they signal an era where sub second loops and safer autonomy are not special features. They are the default. If you build with that assumption in 2025, you will be ready for the users and workloads that define 2026.
And that is the breakthrough worth paying attention to. It is the moment when production teams stop treating agents as experiments and start treating them as systems they can scale, reason about, and trust.








