Compute

SAP SAPPHIRE 2026: Google Cloud unveils unified agentic vision and massive compute scaling

Tue, 12 May 2026 19:00:00 +0000

In today's hyper-connected market, an enterprise's most valuable asset — mission-critical data — often remains trapped in legacy silos. For years, leadership teams have navigated a data pipeline dilemma, forced into a cycle of complex data movement that relies on slow, manual extraction processes. This fragmentation strips away essential business context, increases technical debt, and creates operational blindness.

Al is reshaping enterprise operations, but to move beyond simple optimization and actively transform core processes, organizations must be able to turn their vast data into tangible action. The true value of Al lies in its ability to bridge the gap between deep business insights and strong execution. To shift enterprises from a reactive posture to a state of predictive, real-time intelligence, SAP and Google Cloud are delivering a Unified Data Foundation. This deepening partnership connects critical business data directly to intelligent workflows, so that every insight effortlessly transitions from intent to action without disruption.

Here’s a breakdown of the key new features announced today at SAP SAPPHIRE to modernize the enterprise core and unlock true data value:

Open agent collaboration: Through expanded strategic partnerships, SAP is integrating new agentic capabilities into the SAP Business AI Platform. This establishes an open architectural framework, enabling bidirectional communication between SAP’s Joule agents and intelligent agents built on Google Cloud (such as Gemini Enterprise Agent Platform and Gemini).
SAP BDC Connect for BigQuery: The SAP Business Data Cloud (BDC) Connect for BigQuery — currently in private preview — enables customers to share their semantically rich SAP data directly into BigQuery. This establishes bidirectional, zero-copy, and zero-cost data access, allowing organizations to unify their data footprint without the complexity of moving or duplicating massive datasets.
50% larger memory instances: Breaking the previous 32TB memory limit of the X4 memory-optimized machine type, a new X5 series introduces massive 48TB configurations. This empowers the largest SAP HANA and RISE with SAP customers to easily scale up their mission-critical databases on a single node.
Sovereign Cloud with S3NS: SAP is partnering with S3NS to deploy its RISE private cloud on a SecNumCloud-qualified platform in France, enabling regulated organizations like Thales to securely transform their ERP environments.
Google SecOps for SAP: Google and SAP are partnering on agentic security workflows and threat detection for SAP applications. Available in preview, Google SecOps for SAP provides agentic AI security operations and empowers security teams to detect SAP-specific threats alongside their broader IT landscape.
Google Cloud Cortex Framework: Cortex Framework simplifies your journey from SAP to AI. Now in preview, these data product accelerators lower the risk and cost of building agentic solutions using BigQuery and Gemini.

What they're saying

Mercado Libre is the leading e-commerce and fintech business in Latin America with more than 100 million users, and Google Cloud’s new memory-optimized instances for SAP are having a big impact:

"We are pushing very hard to use AI capabilities to leverage the information we generate from BigQuery, and we are also using Gemini to empower our employees to be more productive following our migration to RISE. As our business has experienced unprecedented growth, ensuring our data infrastructure can keep pace with this AI-driven trajectory is critical. Google Cloud’s announcement of the new 48TB instances is a game-changer for us. It allows us to seamlessly scale our mission-critical databases on a single node, avoiding major application redesigns and ensuring our real-time operations continue without disruption as we scale." - Alejandro Bonsignore, Finance and People Systems Senior Manager, Mercado Libre

The agentic future: Making data active

The goal of these integrations is simple: to transform static records into autonomous, agentic workflows. Before Al can act with precision, it requires a comprehensive understanding of how your business runs. By utilizing the BDC connector to extend a unified foundation across the broader enterprise data estate, agents running on Gemini for Google Cloud and Gemini Enterprise Agent Platform can natively utilize this trusted data as part of their workflow.

Crucially, Cortex Framework accelerates this journey by moving beyond basic data consolidation to transform fragmented enterprise data silos into context-rich, high-fidelity data products. It establishes a trusted semantic layer that translates raw "database speak" into meaningful "business speak." This gives your organization the ability to make Al more reliable, accurate, and capable of taking decisive action across platforms.

Transforming data into autonomous action at scale means enterprise-grade governance remains paramount. Together, SAP and Google Cloud provide the holistic capabilities needed to govern Al responsibly across the entire organization. Grounding Gemini models in governed enterprise context directly mitigates Al hallucination risks and drives strong value. With this collaboration, your organization has the peace of mind that every agent operates securely, is grounded in trusted data, and remains fully accountable as it drives measurable business outcomes.

Learn more about the SAP and Google Cloud partnership.

Cluster-level reliability for trillion-parameter models on TPUs

Mon, 11 May 2026 16:30:00 +0000

Frontier AI models have redefined the unit of compute. At trillion-parameter scale, AI training requires thousands of interconnected components, orchestrated in industrial-scale deployments to operate as a single, massive entity.

Likewise, when it comes to reliability, aggregate infrastructure availability is what matters. Yet for almost two decades, instance-level reliability has been the cloud standard. Designed for microservices and horizontally scalable applications, instance-level reliability treats infrastructure as a collection of small independent units. This model is fundamentally inadequate for large-scale AI workloads.

We believe reliability must shift from an instance- to a cluster-level model.

For over a decade, Google has operated Tensor Processing Unit (TPU) clusters at scale, achieving reliability that meets the architectural requirements of modern AI workloads. In this blog, we’re presenting our cluster-level reliability framework for Google Cloud TPUs that focuses on collective performance at the superpod level, and one we use internally within Google to build the world’s most advanced AI models. This framework is the operational standard for TPUs in production today, and serves as the architectural blueprint for our recently announced eighth-generation TPUs.

Reliability for AI supercomputers

TPU superpods consist of thousands of chips arranged into cubes (64 TPUs), with high-speed Inter-Chip Interconnect (ICI) links connecting every chip within a cube and a dynamically configurable Optical Circuit Switch (OCS) network connecting all cubes to form a superpod.

For system-wide training progress, we must maximize the number of fully healthy cubes within a superpod. Because the performance of AI models relies on high-bandwidth, low-latency communication, every chip and ICI link within a cube must be operational for that unit to contribute to the training progress. Driven by these architectural realities, our cluster-level framework helps define how the industry can achieve reliability in the AI era, moving from instance-level reliability to availability of scale.

Deep dive: The mathematics of availability at scale

Instance-level reliability models are often deterministic, but industrial-scale AI deployments require a probabilistic approach over thousands of chips. In a traditional setup, you might track the Mean Time Between Failures (MTBF) of a single chip. However, at the scale of frontier AI, the cluster-level MTBF drops sharply as the number of components grows.

To visualize how quickly scaling can erode confidence, we can look at simple bounds like Markov’s inequality.

If we define X as the number of failed cubes, Markov’s inequality reminds us that as the expected number of failures E[X] increases with cluster size, the probability of staying below a strict failure threshold becomes increasingly difficult to guarantee without systemic architectural changes.

While Markov’s inequality provides a helpful rule of thumb for the risks at scale, we model the availability of scale using a binomial distribution of aggregate cluster health. For a superpod composed of n independent units (cubes), we define the probability of having at least k fully operational and interconnected cubes as the cumulative distribution of the success of n independent trials. To ensure a 95% confidence interval for training productivity, we solve for k where:

Where n represents the total cubes in a superpod and p represents the aggregate cube-level availability.

This model replaces the instance-level model with a topology-aware framework that mirrors actual performance requirements of large-scale training, ensuring that the larger block of compute is healthy and connected and can drive continuous training progress.

Scale of modern AI hardware

To demonstrate this new reliability model, we used Ironwood, Google’s generally available, seventh-generation TPU, and the custom silicon behind advanced models like Gemini and Nano Banana.

Pictured: Part of an Ironwood superpod, directly connecting 9,216 Ironwood TPUs in a single domain.

An Ironwood superpod is a dense, high-performance fabric consisting of 9,216 chips integrated into a single compute domain. It’s organized into 144 cubes, where each cube contains 64 chips. Within these cubes, ICI links create an extremely dense, all-to-all network fabric that provides massive bandwidth and low-latency connectivity for distributed operations within the cube. To form a superpod, 144 cubes are connected using OCS. For large jobs, capacity can be provisioned by interconnecting multiple cubes within a pod into one super-slice, or connecting multiple slices to form a multi-slice cluster. Cubes across multiple superpods can be connected over the datacenter network to run even larger workloads.

Using this model, we determine that the topological availability for an Ironwood superpod is 130 out of 144 cubes available for 95% of the month. This translates to a large compute block of 8,320 chips that are fully operational and interconnected via ICI and OCS, establishing a reliability model specifically optimized for hero jobs (the massive training runs of frontier AI).

The relationship between cluster size and its statistical availability is non-linear. By adjusting the required confidence level, we can identify the slice size that can be supported with statistical certainty. For researchers, this mapping provides a capacity availability curve. An organization with a workload that requires 99% availability for a mission-critical run can optimize their slice size to 125 cubes, while those pushing utmost scale can utilize 130 cubes at the 95% confidence interval.

Capacity availability curve for an Ironwood superpod (144 cubes)

This new reliability model maximizes the utility of the entire superpod through:

Full access: This model does not constrain capacity utilization; it focuses on the availability of fully healthy cubes. While a single chip or ICI failure results in the entire cube being classified as unhealthy, customers continue to have access to the remaining capacity within the cube. This makes most of the Ironwood superpod available for use while also optimizing the compute footprint for high-stakes, large-scale training.
Optimized resource usage: While the 130-cube model focuses primarily on large-scale training runs, the full superpod remains available for a heterogeneous mix of workloads. This allows researchers to utilize the remaining cubes for research experiments, inference, and dev/test workloads, maximizing the utility of the superpod without compromising the reliability of the main training run.

Our customers are using Ironwood at scale today and this model has empowered them to train their most demanding hero jobs.

Enhancing ML productivity

The goodput metric is the primary measure of ML productivity. Our new reliability standard provides the deterministic foundation for goodput and is engineered to maximize this metric for demanding hero jobs, so that the massive scale infrastructure required for frontier research is ready to perform as a single entity.

This model achieves high scheduling goodput, one of the three goodput metrics, by making the full set of resources available for massive-scale training runs. Combined with the software stack, this infrastructure-level availability helps deliver the high overall goodput. We achieve this through a three-layer reliability model:

Infrastructure: TPU superpods provide the capacity footprint to ensure the necessary scale is physically available and connected.
Frameworks: JAX and Pathways provide resilience, reconfiguring or hot-swapping around failed nodes to maintain forward progress without requiring a full restart.
Application: Fault-tolerance mechanisms like auto-checkpointing and multi-tier checkpointing preserve training state, so that lost progress is minimized in case of a failure.

Enabling the next generation of AI breakthroughs

The cluster-level reliability model marks the beginning of a new standard for the AI era, where an AI supercomputer is a dependable, industrial-scale engine for innovation. By aligning our reliability posture with the demands of frontier models, we’re enabling the next generation of AI breakthroughs to be faster, more reliable, and more predictable. Click here to learn more and get started with TPUs.

What’s new with compute: Scaling core and agentic workloads

Wed, 22 Apr 2026 12:00:00 +0000

At Google Cloud Next, we’re announcing a range of compute capabilities to enable your core general purpose and AI workloads for the agentic world with higher performance and lower costs.

Why it matters: IT leaders and builders are faced with balancing compute investments and resources between agentic AI and the general purpose use cases, including the web servers, databases, and enterprise applications that drive everyday customer experiences.

On one side, agents can place unpredictable demand on compute infrastructure, often scaling exponentially. A single user interaction can instantaneously kick off hundreds of concurrent, high-throughput, and low-latency tasks. On the other side, general-purpose workloads generate and hold the data required to fuel the agentic world. Relying on static and siloed infrastructure to run them can risk performance bottlenecks and spiraling costs, leaving your organization unable to respond to surges in demand.

Consider a global travel application where a simple vacation search instantly triggers a massive orchestration of agentic inventory checks, dynamic pricing models, and AI-driven personalized itineraries. Without a modern architecture, this sudden surge in demand can overwhelm the core booking database and bring business to a halt.

We address this with fluid compute, Google Cloud infrastructure that adapts to your general-purpose and agentic workflows, enabling both to win by flexing in performance, capacity, and scale, all in real time. This dynamic flexibility relies directly on the automated orchestration of Google Kubernetes Engine (GKE) and our new Agent Sandboxes to instantly provision secure, isolated execution environments at machine speed.

Let’s take a deeper look at the new compute capabilities announced at Next ‘26.

Run AI and general purpose workloads together

Agentic planning and reinforcement learning depend on highly fluid compute to process unpredictable bursts of autonomous tasks. Relying on static infrastructure to isolate agent-generated code can create severe provisioning delays and heavily inflate your cloud budget. You can remove these bottlenecks by adopting an adaptive cloud foundation. Leveraging GKE Agent Sandboxes empowers your teams to securely launch thousands of execution environments. Pairing these scalable sandboxes with efficient Google Axion processors helps your organization optimize total cost of ownership while fueling artificial intelligence innovation.

Here’s what’s new in Google Cloud compute launches and announcements:

Google Axion N4A is GA: Harness the agility of Google’s custom Arm-based Axion CPUs and achieve up to 2x better price-performance than comparable current-generation x86-based VMs for cost-sensitive workloads such as Java applications, scale-out web servers, and SaaS built by startups, enterprises and partners. Learn more here.
GKE Agent Sandbox, with Axion N4A for price performance, is GA. As the industry’s only native sandbox service among hyperscalers, GKE Agent Sandbox offers scalable and low-latency infrastructure designed for agents to safely execute untrusted code and tool calls without sacrificing performance. With Google Axion, you can build agents on leading infrastructure without compromising on cost or choice. GKE Agent Sandbox running on Google Axion N4A instances provides up to 30% better price-performance than the next leading hyperscale cloud provider. Try GKE Agent Sandbox here.
Google Axion C4A.metal, our first Axion bare metal instance, is in preview: C4A.metal instances power Android development, automotive simulation, CI/CD pipelines, security workloads, and custom hypervisors, without the performance overhead and complexity of nested virtualization. C4A.metal will be GA this summer; learn more here.
C4 instances offer expanded support for Intel Xeon 6 (Granite Rapids) across all shapes: Achieve high-performance for AI workloads like LLM inference and vector search by using Intel AMX with native FP16 support to increase throughput and reduce latency, offering 13% better price-performance versus comparable Intel Xeon 6-based VMs from another leading hyperscaler. C4 VMs are available with Intel Xeon 6 processors across all shapes. Learn more here.
Flexible CUDs expanded support is GA: Shift spend across regions and VM families while optimizing for TCO, with flexible committed use discounts, now with support for a wider range of VM families and services, including memory-optimized (M1-M4) and HPC-optimized (H3, H4D) VM families, as well as Cloud Run. Learn more here.

Here’s what customers are saying:

Unity: Unity is redefining the economics of real-time AI with Unity Vector. By migrating its on-demand feature processor workloads to Google Axion N4A instances, Unity achieved a 20% improvement in cost efficiency without sacrificing latency. As Unity Vector scales to meet increasing demand, the move to N4A instances ensures that Unity continues to deliver industry-leading performance at a sustainable cost.

Deutsche Börse: A leading German market infrastructure provider, Deutsche Börse migrated and modernized dozens of core financial applications onto Google Compute Engine VMs, including latest generation C4 and C4D instances, supporting latency-sensitive Oracle databases and post-trade processing at scale, and boosting release speed, operational agility, and resilience. This delivered the consistent performance they needed to process millions of financial transactions every day and they achieved 58% faster time to market and 33% lower TCO.

WP Engine: WP Engine powers millions of digital experiences where every millisecond matters. By running GKE clusters on C4D and N4D instances, WP Engine has seen up to a 60% reduction in latency for mobile-optimized REST APIs and up to 51% faster processing for data-rich application requests.

eDreams ODIGEO: Operating a high-volume, AI-powered travel platform where every millisecond dictates the customer experience, eDreams ODIGEO migrated its foundational Java-based ecommerce modules on GKE to Axion virtual machines. This immediately eliminated weeks of manual code optimization, delivered a massive 75% improvement in P95 latency with zero code changes, and unlocked price-performance to scale their global services far more cost-effectively than their legacy x86 infrastructure could.

Chainguard: Prioritizing absolute isolation for their foundational software build system, Chainguard deployed the new Axion C4A bare metal instances. This allowed them to establish a strong hypervisor security boundary for package builds, secure their development pipeline with architectural parity, and ensure robust protection, all without compromising build performance.

Run I/O and latency-sensitive workloads together

Both AI and core workloads depend on the ability to store, read, and move data as a single, high-performance operation. Traditionally, these stages are slowed by network and storage limits tethered to vCPU counts, which can starve AI models of the data they need to function. You can remove these constraints by leveraging accelerated Hyperdisk performance for rapid data access and high-performance networking for consistent transit. By allowing your data pipelines to scale independently of compute, your AI training and I/O-sensitive workloads have the dedicated bandwidth they need to remain stable under peak demand.

C4N is in preview: Running high-volume network applications such as concurrent mobile app requests and real-time inventory updates can risk bottlenecks during peak traffic. Maximize your throughput with C4N, featuring Titanium adapters that offload complex packet processing to deliver a market-leading 95 million packets per second — a 40% performance advantage for high-traffic network applications compared to other leading hyperscalers. Designed to rapidly transfer large datasets, C4N provides nearly 400 Gbps of VM-to-VM bandwidth, a 4x improvement in bandwidth-per-vCPU, and achieves an 8x increase in egress network bandwidth through internet gateways compared to C4 VMs. C4N with Hyperdisk Extreme also provides the low-latency, high-speed data access that modern databases and enterprise AI applications need, with 25 GiB/s of block storage throughput and nearly 1M IOPS. Sign-up here for C4N preview access.
M4N is in preview: Running memory-intensive databases can push organizations to overprovision compute cores (vCPU) to meet memory speeds, driving up software licensing fees. We introduced the new M4N series to solve this exact problem. Running Oracle workloads on M4N with Hyperdisk Extreme can reduce TCO by over 20%, enabling you to run Oracle more efficiently, with 26.57 GiB of RAM per vCPU for scale and on far fewer cores. Paired together, M4N with Hyperdisk Extreme delivers the highest per-core IOPS and throughput for high-memory instances across leading hyperscalers. Sign-up for the preview here.
Announcing Z4D: Optimize I/O-intensive workloads and remove network-based storage bottlenecks with new Z4D instances. By securing up to 84 TiB of high-performance local SSD directly on the node, organizations can process massive datasets for SQL, NoSQL, and vector databases. Z4D provides up to 400 Gbps of VM-to-VM bandwidth, matching both C4N and M4N. Z4D virtual machines and bare metal instances will be in preview soon.

Here is what customers are saying:

Ericsson: 5G Core workloads are inherently network-heavy, demanding high-throughput packet processing and deterministic latency that standard public cloud instances often struggle to maintain at scale. By leveraging the Google Cloud C4N, they’ve found the ideal choice for network performance to power Ericsson On-Demand. C4N’s architectural focus on network-optimized compute allows its 5G Core-as-a-Service to reach unprecedented throughput levels, like its recent 1 Tbps milestone, while maintaining the carrier-grade reliability its customers expect.

Teradata: Teradata’s Autonomous Knowledge Cloud enables the world’s largest enterprises to activate enterprise intelligence and turn trusted data into measurable business outcomes. Customers rely on Teradata to run mission‑critical, highly I/O‑intensive analytics at scale where performance and efficiency directly determine value. C4N instances are well suited for these demanding workloads, delivering strong price‑performance and supporting more efficient, optimized deployments. With C4N, Teradata can help customers accelerate insights, scale with confidence, and drive greater impact from their data and AI investments.

Handle demanding storage requirements

Foundational workloads such as web servers, applications, and databases hold the data required to fuel the agentic world. Siloing this critical information on rigid hardware creates bottlenecks that can completely stall enterprise modernization. Imagine a global retail brand running a holiday promotion, but the inventory database times out and drops customer requests because the legacy hardware couldn’t process the sudden flood of agentic queries.

Organizations require the highest performing database hosts backed by high performance IOPS and throughput per vCPU to ensure non-blocking data delivery. Moving these applications to modern cloud infrastructure dramatically improves total cost of ownership and operational throughput. Through strategic cloud migrations, customers can eliminate the architectural walls that stall modernization and unlock their data for AI. Here is what is new in fluid compute for throughput- and capacity-sensitive workloads:

Hyperdisk Balanced improvements. Hyperdisk Balanced enables fast and efficient block storage for general purpose workloads, including applications and relational databases. With Hyperdisk Balanced you can drive up to 2.4 GiB/s and 160K IOPS per volume, higher than general-purpose block storage offerings from other hyperscalers, all while achieving lower mean latency than alternatives. With Hyperdisk Balanced High Availability you can now achieve a 4x performance improvement for high availability databases like SQL Server or PostgreSQL by dynamically routing full disk performance to the active VM, removing the need to overprovision storage. Leverage zero-downtime encryption key rotation and consistency groups for instant snapshots, making it easier to stay more secure. With these capabilities, you can deliver lower TCO, higher performance, and workload resilience for your general-purpose workloads. Learn more here.
Hyperdisk ML performance improvements and Hyperdisk Exapools are GA: With 2 TiB/s of aggregate throughput (up from 1.2 TiB/s), Hyperdisk ML helps eliminate AI storage bottlenecks, offering more than 200x higher throughput per disk than competitive offerings, so your valuable accelerator clusters never sit idle. This allows you to maximize AI compute ROI while powering the next generation of intelligent agents. In addition, for large-scale training needs, Hyperdisk Exapools offer the highest aggregate block storage performance and capacity, per AI cluster, of any hyperscaler. Learn more about Hyperdisk ML and Exapools here and here.
Announcing Z4M: Access up to 168 TiB of local SSD coupled with up to 400 Gbps of network bandwidth, support for RDMA, and bare-metal shapes to run distributed parallel file systems and large-scale AI/ML workloads. Z4M will be integrated with Cluster Director with the option to be colocated with accelerators to provide fast and low-latency access to data. Z4M VMs and bare metal instances are expected to be in preview in Q3 2026.

Here is what customers are saying:

Shopify: During Black Friday weekend sales, Shopify processed over $14.6B and tracked 136 million packages for 81M buyers using its Shop App built on Compute Engine’s Z-series backed storage — without compromising speed or reliability.

HubX: Operating a massive portfolio of AI-powered mobile applications where rapid model loading dictates the user experience, HubX deployed Hyperdisk ML on GKE to eliminate severe I/O bottlenecks. Leveraging this specialized storage layer allowed HubX to support hundreds of concurrent readers and accelerate pod initialization times by 30x during peak traffic surges, drastically reducing idle accelerator costs and helping ensure their complex inference workloads scaled as expected.

Fluid infrastructure for the agentic era

Now, your foundational workloads and agents no longer need to compete for capacity or performance. With Google Cloud’s fluid compute, you get adaptive cloud infrastructure that prevents bottlenecks and enables both your foundational and AI workloads to collaborate and thrive.

Ready to get started? Head straight to the Google Cloud console to spin up a VM for your next big project. Or start planning your migration by checking out Migration Center's AI-powered toolsets to perform cost estimates, create a business case, and evaluate your modernization options.

Cross-cloud infrastructure innovation for the agentic enterprise

Wed, 22 Apr 2026 12:00:00 +0000

The era of agentic AI is accelerating from human- to machine-speed operations, while also creating profound stress on legacy technology infrastructure. This new reality pushes foundational systems to their limits: agents generate thousands of internal messages and complex queries, spawning more agents, all of which can rapidly overwhelm traditional networks and databases, and expose new security vulnerabilities.

Unlocking AI's full potential in the era of agents requires a secure, adaptive foundation. We call it cross-cloud infrastructure for the agentic enterprise – and at Google Cloud Next ‘26, we’re launching a powerful set of new innovations across four areas:

What’s new:

Fluid compute: Google Compute Engine and Kubernetes services work together to enable cost-effective, high-speed AI agents and enterprise workloads with new compute and orchestration capabilities.
Secure cross-cloud connectivity: Agent Gateway, Cloud Armor, and other tools deliver a secure, governed, and simplified networking foundation for AI agents, including observability of agentic traffic across clouds.
Unified data layer: Smart Storage, Knowledge Catalog, and other innovations transform passive data archives into dynamic reasoning engines, giving AI agents the context they need to execute.
Digital sovereignty: Confidential External Key Management and new features in Google Distributed Cloud bring Google’s leading models and AI enablers wherever your data lives.

Let’s take a closer look at all the news for each of these four areas.

Fluid compute

Agentic workloads are dynamic and unpredictable, impacting both traditional enterprise applications and the AI agents themselves. Fluid compute is enabled by Google Compute Engine and Google Kubernetes services working together to dynamically adapt and shift weight in real-time, enabling cost-effective, high-speed AI agents and operational enterprise workloads for all customers.

While our AI Hypercomputer delivers raw power for large-scale AI model training, fluid compute addresses the needs of operational workloads and agents. As agents move toward reasoning and reinforcement learning, CPUs are reclaiming a central role, excelling at the "branchy" logic, complex control flows, and secure execution sandboxes (like those for agentic orchestration, RL, SLM inference, and RAG) that agent workflows demand. CPUs also provide the critical isolation needed for secure agent execution, complementing the parallel processing strength of GPUs and TPUs used in training.

We are introducing new CPU families, GKE capabilities, and Hyperdisk block storage capabilities to run traditional workloads and AI agents securely at scale, including:

Google C4N Series: These VMs help ensure your enterprise workloads don't slow down under the demands of agentic AI by processing up to 95 million packets per second, up to 40% faster than other leading hyperscalers. This eliminates I/O bottlenecks for demanding workloads like security appliances, streaming media, and open source databases, even when utilizing smaller instance sizes.
Google M4N Series with Hyperdisk Extreme: M4N removes data pipeline bottlenecks and eliminates overprovisioning to deliver industry leading per-core IOPS and throughput required to handle massive data I/O from agents, analytics, and mission-critical databases. M4N provides 26.57 GB of RAM per vCPU, allowing you to scale mission-critical workloads cost-effectively on fewer cores. For example, M4N with Hyperdisk Extreme reduces Oracle workload total cost of ownership by over 20% compared to leading hyperscale clouds.
GKE Agent Sandbox: This solution secures agents with trusted gVisor isolation and handles demand spikes, launching up to 300 sandboxes per second, per cluster. Backed by the only managed sandbox technology available among leading hyperscale clouds, it achieves up to 30% better price-performance than competitors when running AI agents on GKE Agent Sandbox with Google Axion N4A.

“Wayfair's AI strategy is built on years of systematic infrastructure modernization on Google Cloud — migrating our core eCommerce engine and databases off legacy systems, decomposing monolithic services into cloud-native architecture, and unifying our data and analytics platform. That foundation is what makes everything else possible. Today, Gemini Enterprise Agent Platform is powering everything from catalog enrichment to generative shopping experiences that help customers create a home that's just right for them — and it's the same foundation preparing us for the agentic era, where AI doesn't just assist but actively drives discovery, personalization, and commerce across every customer touchpoint and across our business.” - Fiona Tan, Chief Technology Officer, Wayfair

Explore all our latest compute innovations in this blog.

Secure cross-cloud connectivity

Agentic AI replaces predictable human requests with autonomous “reasoning loops,” in which agents call other agents that, in turn, call LLMs, triggering massive, sudden surges in compute and machine-to-machine traffic. This shift creates unique challenges for network predictability and security of non-human identities. Optimized for agentic AI, our Cross-Cloud Network moves data across diverse environments, connecting employees, customers, and agents with visibility and security. New in Cross-Cloud Network are:

Agent Gateway: Governs and orchestrates your enterprise agentic traffic as the “air traffic controller” in Gemini Enterprise Agent Platform. It natively understands agent protocols like MCP and A2A to inspect and govern every agent interaction. By integrating with Google and third-party identity and AI safety services, it enables deep inspection to verify access, block attacks, and protect sensitive data, maintaining compliance across your core business.
Cloud Network Insights: Delivers broad visibility across your hybrid and multi-cloud infrastructure to drive faster troubleshooting and network resolutions. Continuously monitor your end-to-end agent, network and web performance across Google Cloud, AWS, Azure, data centers, internet applications, and agentic workloads. Using synthetic traffic analytics, Cloud Network Insights provides hop-by-hop network path visibility to help you pinpoint the source of degradations and is coupled with AI-powered insights from Gemini Cloud Assist to deliver more autonomous operations.
Enhanced Cloud Next Generation Firewall (NGFW) and Cloud Armor: Provides machine-speed, AI-powered protection to combat the rapid explosion of AI-generated polymorphic malware and zero-day exploits. Cloud NGFW advanced malware sandbox delivers real-time inline prevention of AI-generated threats, while Cloud Armor managed rules provides automated protection against both known and unknown Common Vulnerabilities and Exposures (CVEs). Together with Model Armor, these services analyze the intent and content of AI agent communications.

Discover more about how we optimized networking for AI in and outside of the data center.

Unified data layer

AI agents are only as powerful as the data they can access and the context they’re given. More applications and platforms are using structured and unstructured data, but it can be difficult to catalog, find, and act on that data at scale, leading to less effective agent interactions. To close the gap, your agents need all of your data brought together into a cohesive, queryable knowledge engine, or unified data layer. This way, your agents can identify and access accurate sources. At Next ‘26, we’re enhancing the unified data layer with:

Smart Storage: This solution transforms dark data into a powerful knowledge asset for AI agents and training by embedding new semantic intelligence directly into your data objects. With new Google Cloud Storage capabilities like automated annotation, entity extraction, and semantic search, your agents can instantly find and use the specific data they need — whether it's hidden in spreadsheets, PDFs, or other unstructured formats across your entire organization. This significantly speeds up the development and deployment of your AI solutions. Learn more about storage innovations to accelerate your AI workloads.
Knowledge Catalog: Knowledge Catalog maps business meaning across your entire data estate, providing a grounded source of truth so agents can deliver the most accurate results. This foundation enables AI training and inferencing and doesn’t require you to migrate your data; your agents interact with it directly, wherever it lives, with full context and governance, making modernization easier.

Part of our Agentic Data Cloud, Smart Storage and Knowledge Catalog can take your data from a passive archive into a dynamic reasoning engine.

“AI is critical to making our customers’ smart home and security solutions more intelligent and convenient. By leveraging Google Cloud’s Smart Storage, we auto-annotate rich metadata delivered in BigQuery. We’ve scaled and accelerated our data discovery and curation efforts, speeding up our AI development process from months to weeks, continuously delivering innovations that build trust and enhance the overall home experience.” - Brandon Bunker, VP of Product, AI, Vivint

Digital sovereignty

In the agentic era, digital sovereignty is a fundamental requirement for public sector and enterprise customers looking to accelerate innovation — without sacrificing control. There’s no one-size-fits-all solution, which is why we’ve designed a comprehensive set of offerings to meet different sovereign AI needs anywhere: public cloud, on-premises, or hybrid. New capabilities in our sovereign AI portfolio include:

Confidential External Key Management: Organizations can use Confidential External Key Management to maintain complete possession, custody, and control of your encryption keys and the policies that govern them. Confidential External Key Management leverages Confidential Compute to host the key management endpoint in a tamper-proof environment within Google Cloud. You are in control and determine where your keys are stored, who can access them, and under what circumstances. Even highly privileged Google administrators cannot access your keys without authorization, which you can revoke at any time. Your data, your control.
Gemini on Google Distributed Cloud: With Gemini on GDC, companies can securely deploy Gemini in sensitive environments, while meeting data sovereignty needs. Your choice of deployment models includes managed software on your connected hardware or a fully disconnected, air-gapped solution. You can now scale with Google's leading AI capabilities even in the most restricted, high-security environments — from powerful Gemini models to advanced coding, search, and other agentic capabilities.

In addition, Google Distributed Cloud supports an end-to-end AI stack, combining our latest-generation AI infrastructure with Gemini models to accelerate and enhance all your sovereign AI workloads. This stack includes:

NVIDIA Blackwell GPUs: NVIDIA Blackwell (NVIDIA HGX B200) and NVIDIA Blackwell Ultra platforms (NVIDIA HGX B300) GPUs accelerate AI performance, leveraging fifth-gen NVIDIA NVLink to deliver data-center scale bandwidth directly to your environment.
New VM families: New A4 family offerings provide the ability to handle the most demanding inference tasks, delivering a 2.25x increase in peak compute. Memory-Optimized M2 and M3 brings the high memory-to-vCPU ratios needed for massive ERP and data analytics workloads on-premises.
Enhanced storage: Eliminate storage bottlenecks with 6x storage capacity per zone and a 10x performance boost, giving you the ability to do AI reasoning on-premises. Now, your data infrastructure moves at the speed of AI reasoning.

"Our customers demand high-performance, private AI inference without the risks of multi-tenancy. Google Distributed Cloud allows us to provide dedicated, low-latency environments that meet strict sensitive data requirements. With the ability to run Gemini on B200s and B300s, we can significantly increase inference speeds and provide the token throughput our clients need to scale." - Dave Driggers, CEO & Co-founder, Cirrascale Cloud Services

Transforming vision into reality

When these product areas converge, your infrastructure evolves into a high-performing, secure, adaptive foundation for the agentic era. We're not just offering tools; we're providing the architectural blueprint to enable enterprises and the public sector to rapidly embrace the full power of AI and agents with confidence.

To learn more about key industry trends for AI Infrastructure, read our State of Infrastructure in the Agentic AI Era report.

New innovations in Google Distributed Cloud

Wed, 22 Apr 2026 12:00:00 +0000

Today at Google Cloud Next, we’re announcing new capabilities in Google Distributed Cloud (GDC) that bring Gemini and our advanced AI stack to wherever your data is, so you don’t need to compromise between AI innovation and sovereignty. This will serve as a catalyst for a sovereign neocloud architecture.

GDC brings Google Cloud to wherever you need it — in your own data center or at the edge. It is offered as two distinct models to meet your specific security and hardware requirements: GDC air-gapped, a fully disconnected deployment that runs on purpose-built, Google-supplied hardware designed for maximum security and compliance; and GDC connected, where you benefit from an integrated, Google-managed software lifecycle on your own hardware.

Traditionally, enterprises and governments with strict data regulatory and sovereignty requirements, were locked out of the latest AI capabilities. Their only choice was to build their own systems, which is slow, complicated, and expensive. GDC ends that struggle. You get world-class AI innovation in your own premises without the toil.

GDC delivers a complete, on-premises AI solution: managed infrastructure optimized for AI workloads, a choice of Gemini or open models for flexibility and efficient Inference services that are cost effective. This foundation allows you to build and run secure AI agents and applications while maintaining total control over your data.

Let’s take you through how the new innovations in GDC come together to support your sovereign AI workloads.

Managed AI infrastructure

To support sovereign AI needs on-premises, organizations require managed infrastructure that can handle the massive performance demands of compute, storage, and networking. Because on-premises AI workloads are dynamic and unpredictable, we are introducing new infrastructure innovations that deliver peak performance across a variety of requirements:

NVIDIA Blackwell GPUs: Accelerate AI performance with NVIDIA Blackwell (NVIDIA HGX B200) and NVIDIA Blackwell Ultra platforms (NVIDIA HGX B300) GPUs, leveraging 5th-gen NVIDIA NVLink to deliver data-center scale bandwidth directly to your environment
Google Cloud machine families: GDC already supports the N2 and N3 machine families for general-purpose workloads, and now it supports the new A4 machine family delivering a 2.25x increase in peak compute to handle demanding inference tasks. We’re also bringing the memory-optimized M2 and M3 machine families to GDC for workloads like ERP and data analytics that require higher memory-to-vCPU ratios.
Enhanced storage scale and performance: GDC now supports 6PB object storage per zone (as compared to 1PB earlier) — 6x the previous storage capacity. In addition, it now offers 30 IOPS/GB (as compared to 3 IOPS/GB earlier) per zone, a 10x performance boost, minimizing storage bottlenecks.

Foundational models in your data center

With GDC, you can bring the power of Google’s flagship Gemini models directly into your environment, bridging the gap between world-class generative AI and strict data sovereignty by enabling native deployment within your own perimeter, now powered by the latest generation NVIDIA Blackwell GPUs.

Today, we are excited to announce that the latest Gemini Flash models are now available (in preview) on the NVIDIA Blackwell and Blackwell Ultra Platforms for GDC connected customers, joining our existing support for GDC air-gapped customers.

"Deploying Gemini on Google Distributed Cloud has significantly improved our global manufacturing. Running frontier AI locally allows us to analyze IoT data for real-time predictive maintenance and quality control, avoiding cloud latency. We maintain strict data sovereignty over our IP while retaining cloud-like agility." - Junhee Lee, CEO, Samsung SDS

AI Inferencing services: Introducing Google Distributed Cloud AI gateway

To optimize performance and abstract infrastructure complexity, we are introducing the AI gateway for sovereign environments. This intelligent middleware acts as the control plane for your models. This provides:

Dynamic request routing: Automatically routes inference requests to the right AI model based on cost, latency, and accuracy, rather than on hard-coded logic.
Intelligent load balancing: Routes requests for optimized inference efficiency, picking GPUs based on their utilization.
Quota management: Prioritizes requests to ensure high-priority applications receive required throughput, and meet quota management goals.
Observability: Built-in tracing and logging for every inference call, helping ensure auditability for compliance-heavy environments.

Agentic AI applications and agents

To truly operationalize AI at the edge, organizations need more than just foundational models. They need autonomous, secure agents built on an agentic AI architecture that can take action. We are thrilled to announce a new sovereign agentic AI architecture for Google Distributed Cloud. Built with 3rd party providers on Kubernetes, this architecture helps to ensure that your agentic workflows execute entirely within your secure Customer Organization boundary.

Using this agentic architecture, you can build and deploy powerful AI agents for agentic tasks like development, coding or data analysis all within your secure perimeter.

AI anywhere with Google Distributed Cloud

We believe GDC is the best platform to serve Google and other models on-prem, connected and air-gapped, enabling all customers to leverage AI and agentic solutions, without compromising on sovereignty. To learn more about these product offerings, visit our website. The innovations we discussed here deliver the flexibility and security required for the sovereign AI era. To see them in action, join our GDC breakout sessions or the Showcase at Next ’26.

Inside the eighth-generation TPU: An architecture deep dive

Wed, 22 Apr 2026 12:00:00 +0000

At Google, our TPU design philosophy has always been centered on three pillars: scalability, reliability, and efficiency. As AI models evolve from dense large language models (LLMs) to massive Mixture-of-Experts (MoEs) and reasoning-heavy architectures, the hardware must do more than just add floating point operations per second (FLOPS); it must evolve to meet the specific operational intensities of the latest workloads.

The rise of agentic AI requires infrastructure that can handle long context windows and complex sequential logic. At the same time, world models have emerged as a necessary evolution from current next-sequence-of-data architectures, which means newer agents are simulating future scenarios, anticipating consequences, and learning through "imagination" rather than risky trial-and-error. The eighth-generation TPUs (TPU 8t and TPU 8i) are our answer to these challenges, ensuring that every workload, from the first token of training to the final step of a multi-turn reasoning chain, is running on the most efficient path possible. They are built to efficiently train and serve world models like Google DeepMind’s Genie 3, enabling millions of agents to practice and refine their reasoning in diverse simulated environments.

TPU 8: Specialized by design

Recognizing that the infrastructure requirements for pre-training, post-training, and real-time serving have diverged, our eighth-generation TPUs introduce two distinct systems: TPU 8t and TPU 8i. These new systems are key components of Google Cloud's AI Hypercomputer, an integrated supercomputing architecture that combines hardware, software, and networking to power the full AI lifecycle. While both systems share the core DNA of Google’s AI stack and support the full AI lifecycle, each is built to address distinct bottlenecks and optimize efficiency for critical stages of development. Additionally, by integrating Arm-based Axion CPU headers across our eighth-generation TPU system, we’ve removed the host bottleneck caused by data preparation latency. Axion provides the compute headroom to handle complex data preprocessing and orchestration, so that TPUs stay fed and don’t stall.

TPU 8t: The pre-training powerhouse

Optimized for massive-scale pre-training and embedding-heavy workloads, TPU 8t utilizes our proven 3D torus network topology at an even larger scale of 9,600 chips in a single superpod. TPU 8t is designed for maximum throughput across hundreds of superpods, ensuring that training runs stay on schedule.

Here are some key advancements of TPU 8t over prior-generation TPUs:

The SparseCore advantage: Central to TPU 8t is the SparseCore, a specialized accelerator designed to handle the irregular memory access patterns of embedding lookups. While the Matrix Multiply Unit (MXU) handles matrix math, the SparseCore offloads data-dependent all-gather operations, amongst other collectives, preventing the zero-op bottlenecks that often plague general-purpose chips.
VPU/MXU overlap and balanced scaling: TPU 8t is designed to maximize provisioned FLOPs utilization. By implementing more balanced Vector Processing Unit (VPU) scaling, the architecture minimizes exposed vector operation time. This allows for better overlapping of quantization, softmax, and layernorms with the matrix multiplications in the MXU, helping the chip stay busy rather than waiting on sequential vector tasks.
Native FP4: TPU 8t introduces native 4-bit floating point (FP4) to overcome memory bandwidth bottlenecks, doubling MXU throughput while maintaining accuracy for large models even at lower-precision quantization. By reducing the bits per parameter, the platform minimizes energy-intensive data movement and allows larger model layers to fit within local hardware buffers for peak compute utilization.

Figure 1: TPU 8t ASIC block diagram

Virgo Network topology and up to 4x data center network increase: To support the massive data requirements of TPU 8t, we introduced Virgo Network. This new networking architecture enables up to 4x increased data center network (DCN) bandwidth on TPU 8t training over DCN. Virgo Network is a scale-out fabric designed for the extreme requirements of modern AI workloads. Built on high-radix switches that reduce network layers by allowing more ports per switch, it employs a flat, two-layer non-blocking topology. Compared with traditional datacenter networks, this significantly reduces latency by minimizing network tiers. It features a multi-planar design with independent control domains to connect TPU 8t chips. The TPU 8t racks also connect with the Jupiter north-south fabric to access compute and storage services. Together, this streamlined architecture delivers the massive bisection bandwidth and deterministic low latency necessary for enabling the world's largest training clusters with high availability.

With 2x scale-up bandwidth on the inter-chip interconnect (ICI) and up to 4x raw scale-out DCN bandwidth compared to the previous generation, TPU 8t drastically reduces data bottlenecks. Then, to further accelerate the development of frontier models, we scale distributed training beyond a single cluster. With JAX and Pathways, we can now scale to more than 1 million TPU chips in a single training cluster. Virgo Network can link over 134,000 TPU 8t chips with up to 47 petabits/sec of non-blocking bi-sectional bandwidth in a single fabric. This fabric delivers over 1.7K ExaFlops with near-linear scaling performance.

Figure 2: TPU 8t rack level connectivity to Virgo fabric

Faster storage access: We are introducing TPUDirect RDMA and TPU Direct Storage in TPU 8t. TPU Direct RDMA enables direct data transfers between the TPU's memory (HBM) and the Network Interface Cards (NICs), bypassing the host CPU and DRAM. This reduces latency and host system bottlenecks, increasing the effective bandwidth for TPU-to-TPU communication. Similarly, TPUDirect Storage bypasses CPU host bottlenecks by enabling direct memory access between the TPU and high-speed managed storage like 10T Lustre, effectively doubling the bandwidth for massive data transfers. This architecture allows the silicon to ingest training data at line rate, ensuring that the MXUs stay fully saturated even when processing large multimodal datasets.

By combining Managed Lustre 10T and TPUDirect Storage to route hundred-petabyte datasets directly to the silicon, TPU 8t prevents training delays caused by data ingestion bottlenecks. This delivers 10x faster storage access compared to training on seventh-generation Ironwood TPUs.

Figure 3: The top diagram shows the data transfer path without TPUDirect Storage. The bottom diagram shows TPU 8t data transfer with TPUDirect Storage between 2 TPU 8t chips and TPUDirect Storage with Managed 10T Lustre storage.

TPU 8i: The sampling and serving specialist

Optimized for post-training and high-concurrency reasoning, we designed TPU 8i with our highest on-chip SRAM, a new Collectives Acceleration Engine (CAE), and a new serving-optimized network topology called Boardfly.

Large on-chip SRAM: With 3x more on-chip SRAM over the previous generation, TPU 8i can host a larger KV Cache entirely on silicon, significantly reducing the idle time of the cores during long-context decoding.

Figure 4: TPU 8i ASIC block diagram

The Collectives Acceleration Engine (CAE): To solve the sampling bottleneck, TPU 8i uses the CAE, which aggregates results across cores with near-zero latency, specifically accelerating the reduction and synchronization steps required during auto-regressive decoding and "chain-of-thought" processing. For each TPU 8i chip, there are two Tensor Cores (TC) on-core dies and one CAE on the chiplet die, replacing four SparseCores (SCs) on core dies in previous-generation Ironwood TPU. By integrating a specialized CAE, TPU 8i further reduces the on-chip latency of collectives by 5x. Lower latency per collective operation means less time spent waiting, directly contributing to higher throughput required to run millions of agents concurrently.
Boardfly ICI topology: While the 3D torus allows connecting thousands of chips to be used in cohesion, a large mesh does have more hops between chips and higher all-to-all latencies. For 8i, we changed how the chips connect together in fully connected boards that are then aggregated into groups. Utilizing a high-radix design, we connect up-to 1,152 of these chips together, reducing the network diameter and the number of hops a data packet must take to cross the system. By slashing the hops required for all-to-all communication (the heart of MoE and reasoning models), Boardfly achieves up to a 50% improvement in latency for communication-intensive workloads.

Figure 5: TPU 8i hierarchical Boardfly topology building up from a building block of four fully connected chips into a fully connected group of eight boards, with 36 of such groups fully connected into a TPU 8i pod

Boardfly consists of the following elements, and its topology is hierarchical by nature:

Building Block (BB): Each tray forms a four-chip ring using internal ICI links, providing 16 external connections for broader networking.
Group (G): Eight boards are fully connected via copper cabling to create a localized group, utilizing 11 of the available external links for intra-group communication.
Pod structure: The final architecture scales to 36 groups (up to 1,024 active chips) linked through Optical Circuit Switches (OCS), ensuring a maximum latency of seven hops for any chip-to-chip communication.

Deep dive: The Boardfly vs. torus math

Why move away from the torus for TPU 8i? It comes down to network diameter.

In a 3D torus, nodes are arranged in a grid where each dimension wraps around like a ring. To reach the furthest possible chip in a 8 x 8 x 16 (1024-chip) configuration, a packet must traverse half the distance of each ring:

3D torus = 8/2(X) + 8/2(Y) + 16/2(Z) = 16 hops

While the torus is highly efficient for the neighbor-to-neighbor communication typical of dense training, it creates a latency tax for all-to-all communication patterns. In the era of reasoning models and MoE, where any chip may need to talk to any other chip to route a token, this hop count matters.

Boardfly’s high-radix topology is inspired by Dragonfly topology principles. By increasing the number of direct optical long-haul links between groups of boards, we flatten the network. For that same 1024-chip pod, Boardfly reduces the network diameter from 16 hops down to just seven.

This 56% reduction in network diameter translates directly to lower tail latency, so that the TPU 8i CAE isn't left waiting for data to arrive from across the pod.

Figure 6: A visual representation of the maximum seven-hop ICI network diameter via optical circuit switch on TPU 8i pod

TPU 8t and TPU 8i at a glance

Feature	TPU 8t	TPU 8i
Primary Workload	Large-scale pre-training	Sampling, serving, and reasoning
Network Topology	3D torus	Boardfly
Specialized Chip Features	SparseCore (Embeddings) & LLM Decoder Engine	CAE (Collectives Acceleration Engine)
HBM Capacity	216 GB	288 GB
On-Chip SRAM (Vmem)	128 MB	384 MB
Peak FP4 PFLOPs	12.6	10.1
HBM Bandwidth	6,528 GB/s	8,601 GB/s (~1.3x of TPU 8t)
CPU Header	Arm Axion	Arm Axion

Software enablement: A performance-first AI stack

Hardware is only as powerful as the software that drives it. The eighth generation of TPUs are built on the same performance-first stack we pioneered with the seventh-generation Ironwood TPUs, designed to make custom kernel development accessible without sacrificing the abstraction of high-level frameworks. This stack includes:

Pallas and Mosaic: We provide first-class support for Pallas, our custom kernel language that lets you write hardware-aware kernels in Python. This enables you to squeeze every drop of performance out of the TPU 8i CAE and the TPU 8t SparseCore.
Native PyTorch experience: We're thrilled to share that native PyTorch support for TPUs is now in preview. If you're currently building and serving models on PyTorch, we've made it easier than ever to start using TPUs. You can bring your existing models to our TPUs just as they are, complete with full support for the native features you rely on, such as Eager Mode.
Portability: The same JAX, PyTorch, or Keras code that runs on Ironwood scales to this generation. Accelerated Linear Algebra (XLA) handles the complex translation of the Broadly topology and CAE synchronization behind the scenes, allowing you to focus on your model, not the interconnect.

Generation over generation: The performance leap

Our commitment to co-designing hardware and software continues to pay dividends. When compared to seventh-generation Ironwood TPU, the eighthgeneration TPUs deliver massive gains:

Training price-performance: TPU 8t delivers up to 2.7x performance-per-dollar improvement over Ironwood TPU for large-scale training.
Inference price-performance: TPU 8i delivers up to 80% performance-per-dollar improvement over Ironwood TPU, particularly at low-latency targets for large MoE models.
Energy efficiency: Both chips deliver up to 2x better performance-per-watt, critical for scaling the next generation of AI sustainably.

Looking ahead

To empower Google Cloud customers pioneering the next wave of innovation, we designed TPU 8t and TPU 8i as two distinct, specialized systems tailored to the multifaceted future demands of the AI lifecycle. TPU 8t and 8i are both purpose-built for the most demanding serving and training workloads, fully integrating with the AI Hypercomputer software stack: JAX, PyTorch, vLLM, XLA, and Pathways. This specialization and ground-up redesign, all in deep collaboration with Google Deepmind, delivers exceptional price-performance and power efficiency.

The modularity of our eighth-generation architecture provides a clear and unique roadmap for the future. Just as every major shift in the computing landscape has required infrastructure breakthroughs, so does the agentic era. Reasoning agents that plan, execute, and learn within continuous feedback loops cannot operate at peak efficiency on hardware that was originally optimized for traditional training or transactional inference; their operational intensity are fundamentally distinct. Our eighth-generation TPU infrastructure has evolved to meet these specific requirements head-on.

To learn more about the eighth-generation TPU family:

What’s next in Google AI infrastructure: Scaling for the agentic era

Wed, 22 Apr 2026 12:00:00 +0000

AI is evolving from answering questions to reasoning and taking action. Companies who want to lead in today’s agentic era require computing infrastructure designed and optimized for these new requirements. Today at Google Cloud Next, we are introducing new AI infrastructure capabilities that help you innovate faster, deliver compelling user and customer experiences, and optimize for cost and energy efficiency — all at massive scale.

The shift to agentic intelligence

In the agentic era, a single intent triggers a chain reaction. Unlike chat, a primary AI agent decomposes goals into specific tasks for a fleet of specialized agents that then collaborate, preserve state, and use reinforcement learning to deliver outcomes in real-time.

This process scales intelligence per interaction, but also creates complexity that yesterday’s architectures cannot support without spiraling costs or performance bottlenecks. To scale efficiently and effectively, you must move beyond manually integrating fragmented components and technologies. To deliver agentic experiences that are smart, fast, scalable, and cost-effective, you need a unified infrastructure stack that spans purpose-built hardware, open software, and flexible consumption models.

Google’s AI Hypercomputer is AI-optimized infrastructure built for the agentic era, engineered to deliver on these new requirements. This is the same foundation that powers Google’s flagship Gemini models, consumer AI services, and enterprise AI offerings. Today, we are announcing a significant expansion of our AI infrastructure portfolio, including:

TPU 8t and TPU 8i, our eighth generation TPUs
A5X bare metal instances, powered by NVIDIA Vera Rubin NVL72
Axion N4A VMs, powered by our custom Axion Arm-based CPUs
Google Compute Engine 4th generation VMs, powered by Intel and AMD x86-based CPUs
Virgo Network, our breakthrough data center fabric for AI workloads
Google Cloud Managed Lustre, a high-performance parallel file system
Z4M VMs with high-capacity local SSD storage and RDMA for open parallel file systems
Dedicated KV Cache scalable storage subsystem
Native PyTorch support for TPUs
New Google Kubernetes Engine (GKE) capabilities for agent-native workload orchestration

Taken together, these capabilities will help you accelerate the development of models and complex agentic workflows to accelerate innovation, and deliver useful, responsive services to customers, all while reducing costs and using energy responsibly at scale.

Let’s take a closer look.

Introducing our eighth-generation TPU systems, purpose-built for agentic AI

Today, we’re pleased to announce the eighth generation of our Tensor Processing Units (TPUs), which for the first time includes two distinct chips and specialized systems, engineered specifically for the agentic era.

TPU 8t is our training powerhouse, specifically designed for high-throughput AI workloads. It redefines the scale of AI development, delivering nearly 3x higher compute performance than previous generations to shrink training timelines for massive models. It packs 9,600 chips in a single superpod to provide 121 exaflops of compute and two petabytes of shared memory connected through high-speed inter-chip interconnects (ICI). This massive pool of compute, unified memory, and doubled ICI bandwidth helps ensure that even the most complex models achieve near-linear scaling and maximum system utilization. We can now turn months of training into weeks with the power of 1 million+ TPU chips in a single cluster, orchestrated by Pathways and JAX.

TPU 8i is our breakthrough reasoning system for inference and reinforcement learning (RL), engineered to deliver the ultra-low latency required by agentic workflows and Mixture of Experts (MoE) models. By tripling on-chip SRAM to 384 MB and increasing high-bandwidth memory (HBM) to 288 GB, it breaks the memory wall, hosting massive KV Caches entirely on silicon. Additionally, it doubles ICI bandwidth to 19.2 Tb/s, reduces the ICI network diameter by over 50%, and introduces a dedicated Collectives Acceleration Engine (CAE), which reduces on-chip latency by up to 5x to minimize lag during high-concurrency requests. With this design, TPU 8i delivers 80% better performance per dollar for inference than the prior generation, enabling fast, interactive user experiences, cost-effectively.

TPU 8t and TPU 8i will be available to Cloud customers soon. To learn more, check out this deep dive on the architecture.

A5X with NVIDIA Vera Rubin platform

We know that one size doesn't fit all. Different customers have different workloads, different requirements, and different use cases. So, we also partner deeply with NVIDIA to deliver the latest GPU platforms as highly reliable and scalable services in Google Cloud. We will be among the first to deliver instances based on the next-generation Vera Rubin platform when it becomes available later this year.

We are also co-engineering the open-source Falcon networking protocol with NVIDIA via the Open Compute Project, pushing the frontiers of reliable transport protocols. A5X will implement a variety of innovative concepts from Falcon.

Thinking Machine Labs, for example, uses our NVIDIA-based infrastructure to power Tinker, an open platform for reinforcement learning and fine-tuning of frontier models for specialized use cases, achieving over 2x faster training and serving with Google’s AI Hypercomputer.

Fueling agentic logic and reinforcement learning with Axion, Intel, and AMD

While GPUs and TPUs are great for training and serving AI models, they need to be complemented with high-performance CPU-based services to handle the complex logic, tool-calls, and feedback loops that surround the core AI model. Our new Axion-powered N4A CPU instances deliver outstanding price-performance for these agent runtimes. In fact, GKE Agent Sandbox with Google Axion N4A offers up to 30% better price-performance than agent workloads on other hyperscalers. This efficiency extends across our entire portfolio, including our 4th generation Compute Engine VM families, powered by the latest x86 instances from Intel and AMD. These are specifically optimized for the broadest range of RL tasks, such as RL reward calculation, agent orchestration, and nested visualization, providing the optimal capabilities for every AI workload.

Virgo Network for data center scale-out fabric

As part of AI Hypercomputer, the Virgo Network is designed to meet the demanding requirements of modern large-scale AI workloads. Its collapsed fabric architecture with 4x the bandwidth of previous generations eliminates the "scaling tax" to deliver staggering peak computing power. This capacity helps the most ambitious AI workloads scale with near-linear efficiency.

With Virgo Network and TPU 8t, we can connect 134,000 TPUs into a single fabric in a single data center, and connect more than one million TPUs across multiple data center sites into a training cluster — essentially transforming globally distributed infrastructure into one seamless supercomputer.

We are also making Virgo Network available for A5X (powered by NVIDIA Vera Rubin NVL72), supporting up to 80,000 GPUs in a single data center, and up to 960,000 GPUs across multiple sites.

Storage: Minimizing data bottlenecks

A massive compute cluster is only as effective as the storage system feeding it data. To ensure storage is not a bottleneck while making compute faster, we are delivering four key storage advancements that let you:

Accelerate training and inference: Google Cloud Managed Lustre now delivers 10 TB/s of bandwidth — a 10x improvement over last year and up to 20x faster than other hyperscalers. We’ve also increased its capacity to 80 petabytes. These advancements are powered by our new C4NX instances and Hyperdisk Exapools.
Minimize latency: Managed Lustre can leverage new TPUDirect and RDMA to allow data to bypass the host, moving directly to the accelerators. By removing this processing overhead, your AI agents can respond with the near-instant speed users need.
Maintain peak utilization for training: Rapid Buckets on Google Cloud Storage transforms object storage with sub-millisecond latency and 20 million operations per second. This helps ensure large-scale training checkpoints and recoveries happen near-instantly, allowing your accelerators to maintain 95% utilization or higher, accelerating training cycles, while also providing cost-effective utilization of valuable TPUs and GPUs.
Build custom solutions: For ISVs and organizations that want to build storage solutions, we are launching the Z4M instance, specifically engineered for customers who want to integrate trusted parallel file systems like Vast Data or Sycomp. Each Z4M instance scales to a massive 168 TiB of local SSD capacity and can be deployed in RDMA clusters of thousands of machines.

These new storage options provide a comprehensive storage portfolio, giving you the raw power of the AI Hypercomputer stack with optimal storage services for each use-case.

GKE: Orchestration for agent-native workloads

In the agentic era, intelligence is only as effective as the speed at which it can be scaled. So, we’ve transformed GKE to serve as the premier orchestration engine for agent-native workloads.

Reducing latency across the stack
To support responsive agentic responses, we optimize every millisecond of the start-up and scale-out process. By streamlining how infrastructure responds to surges in demand, GKE ensures that your agents are ready the moment a user engages with the system. New in GKE are:

Accelerated node and pod startup: GKE nodes now start up to 4x faster, while pod startup times have been slashed by up to 80%.
Rapid model loading: Leveraging the run:AI Model Streamer and Rapid Cache in Google Cloud Storage, models now load 5x faster, removing a traditional storage bottleneck.

Intelligent routing with AI-powered Inference Gateway
Building on last year's introduction of GKE Inference Gateway, we are using "AI for AI" to solve the complexities of serving at scale.

Inference Gateway’s new predictive latency boost replaces heuristic guesswork with machine learning-driven, real-time capacity-aware routing. This intelligent orchestration cuts time-to-first-token (TTFT) latency by more than 70% without manual tuning. For businesses, this translates directly into more natural voice conversations and smooth, real-time interactions across a range of use cases.

Inference Gateway can be deployed alongside llm-d, a Kubernetes-native high-performance distributed LLM inference framework, which was recently accepted as a Cloud Native Computing Foundation (CNCF) Sandbox project. Google Cloud is proud to be a founding contributor to llm-d alongside Red Hat, IBM Research, CoreWeave, and NVIDIA, uniting around a clear, industry-defining vision: any model, any accelerator, any cloud.

Open software ecosystem for the full AI lifecycle

Hardware reaches its full potential through co-designed software. AI Hypercomputer enables engineers to move faster by providing native, optimized support for the industry’s most popular frameworks, including JAX, PyTorch, and vLLM. This open software layer reduces friction between development and deployment, translating to faster time-to-market and better resource efficiency.

We are now in preview with select customers with native PyTorch support for TPU, which we call TorchTPU. With TorchTPU, you can run models on TPUs as they are, with full support for native PyTorch features like Eager Mode. When you combine this with our robust support of vLLM on TPU, our message is clear: we always focus on building for openness and customer choice.

Your foundation for agentic growth

To innovate quickly and cost-effectively in the agentic era, you need a unified system that doesn’t compromise on performance or choice. That is exactly what AI Hypercomputer delivers. By co-designing every layer — from the silicon to the software — we remove the integration burden so your teams can focus on driving your business forward.

AI Hypercomputer also serves as the powerful foundation for Google’s entire ecosystem of high-level services. This integrated stack powers everything from Gemini Enterprise to the Gemini Enterprise Agent Platform, ensuring that all these infrastructure innovations translate directly into business value. By leveraging our fully managed services, such as our serverless training service and our new Managed RL API, you can apply AI Hypercomputer’s massive performance gains to customize Gemini with your own business logic, delivering sophisticated, agent-based solutions.

We’re looking forward to seeing what you build next with this updated and expanded AI platform.

A developer’s guide to architecting reliable GPU infrastructure at scale

Thu, 09 Apr 2026 22:00:00 +0000

Editor’s note: This blog post outlines Google Cloud’s GPU AI/ML infrastructure reliability strategy, and will be updated with links to new community articles as they appear.

As we enter the era of multi-trillion parameter models, computational power has transitioned from a utility to a mission-critical strategic asset. To meet relentless training demand, organizations are no longer just building clusters — they are engineering massive, integrated compute ecosystems comprising hundreds of thousands of high-performance accelerators that are interconnected with an ultra-high-bandwidth networking backplane. At this unprecedented scale, raw performance thrives when it is built upon a foundation of systemic resilience.

In "always-on" mission-critical environments, the statistical probability of hardware variance becomes a primary constraint for reliability. When thousands of GPUs are operating at peak utilization for months at a time, a 0.01% performance fluctuation can trigger a systemic failure. The cost of training interruptions now measured in millions of dollars and weeks of lost progress, the industry's focus has shifted. The true frontier of training isn't just about the size of the cluster — it’s about the resilient system architecture that is able to power the next generation of AI workloads.

The core challenge for the industry goes beyond simple hardware fixes; it requires the creation of holistic software and infrastructure frameworks designed to withstand the inevitable disruptions of massive-scale computing. In an environment where AI/ML infrastructure represents a major capital expenditure on a company's balance sheet, partnering with a cloud provider that places a premium on infrastructure reliability is paramount.

Operational realities of AI at scale

The construction of a supercomputer utilizing hundreds of thousands of advanced GPUs involves significant operational complexity. Maintaining optimal utilization over several months to train a single large language Model (LLM) subjects the hardware to high levels of sustained performance that exceed the design parameters of conventional data center equipment. The advent of rackscale GPU architectures, such as the NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72, has shifted the landscape. Considerations now extend beyond individual machines to encompass entire domains, impacting multiple interconnected trays with the potential to require coordinated management for AI/ML workloads to avoid disruptions.

The business implications of infrastructure instability

For organizations at the forefront of AI innovation, infrastructure reliability poses a significant commercial risk with substantial economic consequences.

High cost of failure: A single failure in a massive training job requires restarting from the last checkpoint, wiping out days or even weeks of progress. When infrastructure spend is a big capex, every failure counts.
Delayed time-to-market: In the fast-moving AI space, being first matters. Every day spent debugging hardware failures is a day delaying releasing new models while competitors are getting ahead. Reliability issues can directly slow down model iteration cycles, delaying product launches and feature updates.
Operational complexities: Manually managing a large GPU cluster is a resource-intensive task. Companies come to the cloud to reduce the cost of managing the infrastructure. Without systemic reliability investments, operations teams can get overwhelmed by a constant stream of alerts, forced to play "whack-a-mole" to identify, isolate, and replace faulty nodes thus affecting their time spent on planning for the future capacity and model demands.
Expensive workarounds to mitigate failure impact: To achieve a certain level of performance and Goodput, companies can end up needing to buy 10-20% more hardware than they actually need as a buffer.

Quantitative assessment: Key reliability metrics

Beyond traditional uptime measurements, the primary metrics Google Cloud uses to measure AI infrastructure health and stability are MTBI and Goodput.

Mean Time Between Interruption (MTBI): The average time a system runs before encountering an interruption. Includes instance terminations as well as every customer workload interruption that our systems can observe (example GPU XIDs).
Goodput: The amount of useful computational work completed per unit time.

Google Cloud’s methodology: Engineering systemic resilience

The objective has shifted from expecting total hardware perfection to engineering systems that demonstrate inherent resilience. We understand that trust in our infrastructure begins with reliability. Our approach is based on four principles:

Proactive prevention: We’ve integrated hardware validation, real-time telemetry, and automated remediation throughout the infrastructure lifecycle. This systemic approach to shift from reactive troubleshooting to proactive management optimizes the reliability of mission-critical GPUs systems at scale.
Continuous monitoring and intelligent detection: We have transformed raw data into actionable insights by synthesizing multi-layered telemetry through automated analysis, to proactively identify and resolve anomalies. This data-driven approach shifts our infrastructure from reactive maintenance to an intelligent, self-healing system that helps ensure continuous workload stability.
Transparency and control: We empower users with full visibility and control over GPU infrastructure health. We provide a comprehensive suite of observability metrics and direct tools, allowing customers to correlate hardware status with their workload Goodput and report faults.
Minimizing disruptions: Our control plane integrates smart scheduling with predictive health signals to enable improved workload migration via maintenance notifications. If unexpected issues arise, customers can enable automated remediations and fast recovery mechanisms to initiate rapid restoration of service.

We have covered an in-depth journey into these principles in our technical deep-dive post linked below. We are launching a comprehensive technical deep dive series to explore Google’s approach towards AI/ML infrastructure reliability for Google Cloud GPUs further. Check back here as we add links to learn about:

Proactive prevention: Inside Google Cloud's multi-layered GPU qualification process
Transparency and Control : Providing Operational Transparency and Management tools to Mitigate GPU Workload Impact (Coming Soon)
Continuous monitoring and intelligent detection: Using ML to predict and prevent GPU downtime (coming soon)
Minimizing disruptions: Smart scheduling and fast recovery systems for mission-critical GPU clusters (coming soon)

AI infrastructure efficiency: Ironwood TPUs deliver 3.7x carbon efficiency gains

Mon, 06 Apr 2026 16:00:00 +0000

At Google, we are committed to being transparent about the environmental impact of our AI infrastructure, publishing metrics on the lifetime emissions of our chips — from manufacturing to powering these chips in the data center. Today, we are updating these metrics for our seventh-generation TPU, Ironwood, which demonstrates an approximately 3.7x improvement in Compute Carbon Intensity (CCI) compared to TPU v5p, the previous generation of performance-optimized TPUs.¹

In other words, despite the fact that AI is driving demand for additional compute resources, our ongoing work to optimize AI hardware is helping to improve the energy consumption and emissions of AI workloads.

Measuring AI accelerator efficiency: Compute Carbon Intensity (CCI)

To help manage the environmental impact of AI workloads, we monitor the Compute Carbon Intensity (CCI) of our AI accelerator hardware. CCI is defined in An Introduction to Life-Cycle Emissions of Artificial Intelligence Hardware²as the estimated amount of CO2 equivalent emitted for every utilized floating-point operation (CO2e/FLOP). This metric provides a holistic, chip-level view by including both the embodied emissions associated with manufacturing, transportation, and data center construction (Scope 3), as well as the operational emissions associated with running these chips in data centers (Scope 1 and 2).

The Ironwood advantage: high performance, low footprint

Google’s TPU CCI continues to improve with each chip generation. Drawing from empirical data measured in January 2026, Ironwood demonstrates a remarkable 3.7x improvement in CCI relative to TPU v5p. This accelerates efficiency gains from the 1.2x CCI improvement of TPU v5p relative to TPU v4, and demonstrates continued carbon efficiency optimization of Google’s performance-optimized TPU architecture.

These efficiency gains are driven by outsized compute performance increases between TPU generations relative to growth in machine energy consumption and manufacturing emissions. In fact, fleetwide measurements demonstrate a 5x improvement in utilized FLOPs across generations, from TPU v5p to Ironwood.³ Because the performance denominator in our CCI equation (CO2e/FLOP) is scaling faster than emissions, the net carbon cost per operation drops significantly with every new chip.

^{Figure 1: Ironwood’s accelerating CCI improvement measured on Google’s performance-optimized TPU cohort, considering January 2026 workloads.4}

Operating Google’s TPU fleet more efficiently

Updated TPU CCI metrics also offer a direct comparison to the measurement we published in 2025. Specifically, from October 2024 to January 2026, Google’s versatile TPU cohort ran more efficiently than what we reported previously:

TPU v5e achieved a 43% reduction in total CCI over 15 months, dropping to 228 gCO2e/EFLOP. This was driven by a 72% increase in average utilization.
Trillium, the sixth-generation TPU, saw a 20% reduction in total CCI over the same time period, bringing its emissions intensity down to 125 gCO2e/EFLOP.

^{Figure 2: Google’s versatile TPU cohort demonstrates deployment efficiency gains for the same TPU generations between October 2024 and January 2026.5}

These results demonstrate that Google continues to improve the carbon-efficiency of our AI infrastructure. While the massive scale of AI demand requires a significant and growing amount of power, our innovations allow us to deliver substantially more compute performance for every unit of energy consumed.

Decoupling energy and emissions from performance

To what can we attribute these improvements? Beyond Ironwood’s raw hardware capabilities, these CCI gains are further enabled by deep software and system-level optimizations across our infrastructure:

Software efficiency (MoE): The widespread adoption of sparse architectures, such as Mixture of Experts (MoE), routes computation only to necessary parameters. This drastically reduces the active FLOPs required per inference or training step without sacrificing model capacity or quality.
Lower precision math (FP8): By heavily leveraging 8-bit floating-point (FP8) formats, we effectively double compute throughput and halve memory bandwidth requirements compared to 16-bit formats. This shows that we can maintain output quality while exponentially decreasing the energy cost per mathematical operation.
Workload mix and intelligent scheduling: Advanced fleet orchestration continuously balances the workload mix across our infrastructure. By intelligently scheduling tasks, we ensure high continuous utilization rates, optimize duty cycles, and minimize the carbon penalty of idle power draw.

Scale sustainably with Google Cloud

AI’s trajectory requires infrastructure that can scale exponentially without an equivalent surge in carbon emissions. The 3.7x carbon efficiency improvement from TPU v5p to Ironwood demonstrates that we can achieve greater compute density while minimizing the growth of our energy and environmental footprint through deliberate hardware and software codesign. To learn more and get started with Ironwood, register your interest with this form.

_{1. Following the methodology published in an August 2025 technical report, we quantified the full lifecycle emissions of TPU hardware as a point-in-time snapshot across Google’s generations of TPUs as of January 2026. The functional unit for this study is one AI computer deployed in the data center, which includes one or more accelerator trays (containing TPUs) connected to one host tray (i.e., a computing server). Peripheral components beyond the tray (e.g., rack, shelf, and network equipment) and auxiliary computing and storage resources are excluded from the calculation of embodied and operational emissions. We include the electricity used in data center cooling in operational emissions. To estimate operational emissions from electricity consumption of running workloads, we used a one month sample of observed machine power data from our entire TPU fleet, applying Google’s 2024 average fleetwide carbon intensity. To estimate embodied emissions from manufacturing, transportation, and retirement, we performed a life-cycle assessment of the hardware. Data center construction emissions were estimated based on Google’s disclosed 2024 carbon footprint. These findings do not represent model-level emissions, nor are they a complete quantification of Google’s AI emissions. Based on the TPU location of a specific workload, CCI results of specific workloads may vary.
2. The authors would like to thank and acknowledge the co-authors of this paper for their important contributions to enable these results: Ian Schneider, Hui Xu, Stephan Benecke, Parthasarathy Ranganathan, and Cooper Elsworth.
3. This comparison considers the utilized FLOPS (BF16) between deployed TPU v5p and Ironwood chips in Google’s fleet in January 2026. This trend is consistent with the improvement in peak FLOPS (BF16) between v5p (459 FLOPS) and Ironwood (2,307 FLOPS).
4.The GHG protocol offers two accounting standards for operational emissions. Results presented here consider market-based emissions, which includes the impact of carbon-free energy purchases. Location-based accounting, which excludes carbon-free energy purchases, would raise operational CCI to 793, 712, and 195 gCO2e/EFLOP, respectively. The ratio of CCI improvements would be at a similar level, and Ironwood’s embodied CCI would drop from 23% to 8% of its total CCI.
5. To ensure a fair comparison across varying TPU utilizations, this analysis replicates the propensity score weighting methodology from the August 2025 technical report and compares January 2026 results to the results published in 2025. This statistical technique adjusts for duty cycle variations to balance the comparison of TPUs during a given time period. This empirical methodology results in small variations in calculated CCI between temporal periods, reflecting fluctuations in real-world energy consumption and hardware utilization across the global infrastructure.}

A developer’s guide to training with Ironwood TPUs

Mon, 23 Mar 2026 16:00:00 +0000

The transition toward trillion-parameter AI models has created an exponential demand for computational resources, testing the limits of traditional infrastructure. The seventh-generation Ironwood TPU features Google’s custom-designed AI infrastructure: It is engineered to scale as a holistic system supporting pods of up to 9,216 chips by combining Inter-Chip Interconnect (ICI), Optical Circuit Switch (OCS), Data Center Network (DCN) and massive aggregated High Bandwidth Memory (HBM) capacity. In addition, Ironwood features an integrated co-design between hardware architecture and software, introducing innovations such as compiler-centric XLA and Python-native kernels via Pallas. Together, these features significantly scale organizations’ capacity to train and serve sophisticated frontier models, optimize the entire AI lifecycle and enable sustained high performance.

This technical overview explores the specific methods and tools within the JAX and MaxText ecosystems designed to refine training efficiency and reach peak performance on Ironwood hardware.

Key optimization strategies for Ironwood

1. Leverage native FP8 with MaxText

Ironwood is the first TPU generation with native 8-bit floating point (FP8) support in its Matrix Multiply Units (MXUs). By utilizing FP8 precision for weights, activations, and gradients, users can theoretically double throughput compared to Brain Floating Point 16 (BF16). When FP8 recipes are configured correctly, increased efficiency is achievable without compromising model quality.

To implement these FP8 training recipes, users can start with the Qwix library. This functionality is enabled by specifying the relevant flags within the MaxText configuration.

See our blog post, Inside the optimization of FP8 training on Ironwood, in the Google Developer forums for more details.

2. Accelerate with Tokamax kernels

Tokamax is a library of high-performance JAX kernels optimized for TPUs. These kernels are designed to mitigate specific bottlenecks through the following mechanisms:

Splash Attention: This mechanism addresses the I/O limitations inherent in standard attention processes. By maintaining computations within on-chip SRAM, it is particularly effective for processing long context lengths where memory bandwidth typically becomes a constraint.
Megablox Grouped Matrix Multiplication (GMM): This manages the “ragged” tensors (data structures with inconsistent row lengths that typically create hardware idle time) often found in Mixture of Experts (MoE) models. By utilizing GMM, the system avoids inefficient padding and ensures higher utilization of the MXU.
Kernel tuning: The Tokamax library includes Utilities for hyperparameter optimization. These tools allow for the adjustment of tile sizes and other configurations to align with the specific memory hierarchy of the Ironwood TPU.

3. Offload collectives to SparseCore

The fourth-generation SparseCores in Ironwood are processors specifically designed to manage irregular memory access patterns. By using specific XLA flags, users can offload collective communication operations—such as All-Gather and Reduce-Scatter—directly to the SparseCore.

This offloading mechanism allows the TensorCores to remain dedicated to primary model computations while communication tasks execute in parallel. This functional overlap is a critical strategy for hiding communication latency and ensuring consistent data throughput to the MXUs.

4. Fine-tune the memory pipeline on VMEM

VMEM, a critical part of the TPU memory architecture, is a fast on-chip SRAM that is designed to optimize kernel performance. You can improve the overall speed of execution by tuning the allocation of VMEM between current operation and future weight prefetch. For example, increasing the VMEM reserved for the current scope allows increasing the tile sizes used by the kernel, which can increase kernel performance by removing potential memory stalls.

Refer to TPU Pipelining for more on TPU memory architecture.

5. Choose optimal sharding strategies

Lastly, MaxText supports various parallelism techniques which are available on all TPUs. The best choice depends on model size, architecture (Dense vs. MoE), and sequence length. Selecting a proper sharding strategy can improve the performance of the model:

Fully Sharded Data Parallelism (FSDP): This is the preferred strategy for training large models that exceed the memory capacity of a single chip. FSDP shards model weights, gradients, and optimizer states across multiple chips. Increasing the per-device batch size and introducing more compute can hide the latency of the All-Gather operations and improve efficiency.
Tensor Parallelism (TP): Shards individual tensors. Given Ironwood's high arithmetic intensity, TP is most effective for very large model dimensions. Leveraging TP with a dimension of 2 can take advantage of the fast die-to-die interconnect on Ironwood's dual-chiplet design.
Expert Parallelism (EP): Helpful for MoE models to distribute experts across devices.
Context Parallelism (CP): Necessary for very long sequences, sharding activations along the sequence dimension.
Hybrid approaches: Combining strategies is often required to balance compute, memory, and communication on large-scale runs.

See the Optimizing Frontier Model Training on TPU v7x Ironwood post in the Developer forums for more detail on techniques 2-5 above.

The Ironwood advantage: System-level performance

These optimization techniques, coupled with Ironwood's architectural strengths like the high-speed 3D Torus Inter-Chip Interconnect (ICI) and massive HBM capacity, create a highly performant platform for training frontier models. The tight co-design across hardware, compilers (XLA), and frameworks (JAX, MaxText) ensures you can extract maximum performance from your AI Infrastructure.

Ready to accelerate your AI journey? Explore the resources below to dive deeper into each optimization method.

Further reading

_{A special thanks to Hina Jajoo and Amanda Liang for their contributions to this blog post.}

Google Cloud and NVIDIA expand AI innovation across industries at GTC 2026

Mon, 16 Mar 2026 16:00:00 +0000

The era of agentic AI is fundamentally changing enterprise infrastructure needs. As organizations build systems capable of dynamic reasoning and autonomous execution, the underlying infrastructure must evolve as well. Scaling these agentic workloads alongside massive mixture-of-experts (MoE) architectures demands a deeply optimized co-engineered stack.

To meet these demands, we’ve built the Google Cloud AI Hypercomputer, an AI-optimized infrastructure as a service, that integrates performance-optimized hardware, leading software, open frameworks, and flexible consumption models into a single, cohesive system to deliver ultra-low latency, high-throughput, and cost-effective inference. To give our customers even more options within this integrated architecture, we are expanding our partnership with NVIDIA.

This week at NVIDIA GTC 2026, Google Cloud and NVIDIA are expanding our partnership with a wave of new announcements, showcasing a co-engineered AI infrastructure foundation:

Infrastructure and hardware

Strong momentum for Google Cloud G4 VMs, powered by NVIDIA RTX PRO 6000 Blackwell Server Edition
Preview of flexible, fractional G4 VMs using NVIDIA vGPU technology — a first in the industry for NVIDIA RTX PRO 6000 Blackwell Server Edition
Upcoming support for NVIDIA Vera Rubin NVL72 Platform

Software and platform

NVIDIA Dynamo integration with GKE Inference Gateway
Enhanced NVIDIA support across Vertex AI Training and Model Garden

Ecosystem

Kaggle competition for NVIDIA Nemotron on G4 VMs
Launch of a dedicated public sector AI startup accelerator program

Let’s take a closer look at the announcements.

Accelerating AI workloads with G4 VMs

G4 VMs, powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, are built to power a diverse spectrum of high-performance workloads — from advanced spatial computing to complete AI development lifecycles. For instance, companies like Otto Group One.O and WPP use the G4 to run physically accurate simulations and real-time 3D rendering at scale.

Beyond simulation, the G4 also shines in model fine-tuning and inference, particularly for models ranging from 30B to more than 100B parameters. By leveraging 4-bit floating point (FP4) precision and Google’s peer-to-peer (P2P) communication, customers are achieving higher throughput for model serving and considerable latency reductions, enabling a new class of real-time, multimodal AI agents and highly responsive generative AI applications.

Here are some examples of how customers are already leveraging the performance and efficiency of G4 VMs to accelerate their most demanding workloads:

“Google Cloud’s G4 VMs give us the scalable GPU backbone we need to push billions of miles of photorealistic simulation through our pipeline. The 4x lift in throughput means our ML teams can iterate faster, train on richer data, and validate edge cases long before our models ever see the real world.” – Sony Mohapatra, Director, AI/ML Engineering, General Motors

“Now with G4 VMs powered by NVIDIA Blackwell, we're pushing our multimodal models even further — faster inference, better reliability, instant replies across languages. The goal stays the same: making voice agents that work at enterprise scale without compromise. We are excited to keep building together and see what our customers deploy with this.” – Mati Staniszewski, Cofounder, ElevenLabs

“Google Cloud G4 VMs provide the computational backbone for our Robotic Coordination Layer, allowing us to synchronize autonomous fleets across our logistics centers with millisecond precision. By simulating complex warehouse environments in a high-fidelity digital twin, we can optimize our entire supply chain virtually before a single robot moves on the floor.” – Dr. Stefan Borsutzky, CEO of Otto Group One.O

“After transitioning to G4 VMs, we achieved a 50% reduction in processing latency and 6x increase in throughput just by updating our Terraform scripts. It’s rare to get that kind of performance boost for our core workloads without adding any operational overhead.” – Alfonso Acosta, Head of Engineering, Imgix

Introducing fractional G4 VMs

We are excited to announce the preview of fractional G4 VMs, providing a highly efficient and cost-effective entry point for AI and graphics workloads. These new configurations, using NVIDIA virtual GPU (vGPU) technology, allow you to leverage the power of the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs in flexible, smaller increments, so you can right-size your infrastructure to match the specific demands of your applications.

“Enterprises need unprecedented flexibility to scale complex, agentic AI workloads. With Google Cloud, we’re introducing fractional G4 VMs powered by NVIDIA RTX PRO 6000 to let customers right‑size GPU capacity and maximize ROI. Together with our co‑engineered stack – from NVIDIA NeMo on Vertex AI to NVIDIA Dynamo with GKE – we’re delivering an open, high‑performance platform for next‑generation reasoning and MoE models.” – Ian Buck, VP / General Manager, Hyperscale and HPC, NVIDIA

By providing more granular access to advanced hardware, fractional G4 VMs let you optimize resource allocation and reduce overhead without sacrificing performance. You can now select from additional GPU slice sizes for your specific needs:

1/2 GPU: Ideal for more intensive tasks such as LLM inference, robotics sensor simulation, and high-fidelity 3D rendering.
1/4 GPU: Optimized for mainstream workloads, including mid-range creative design, video transcoding, and real-time data visualization.
1/8 GPU: Great for lightweight applications such as remote desktops, productivity tools, and entry-level streaming services.

These flexible G4 size portfolio let you:

Right-size infrastructure: Precisely match GPU capacity to application demands, ranging from lightweight remote desktops to intensive data processing.
Maximize cost efficiency: Lower operational overhead by utilizing — and paying for — only the fractional GPU resources you need for specific tasks.
Scale diverse workloads: Power a broad spectrum of innovation, from high-fidelity creative design and streaming to complex robotics simulations and real-time inference.

These fractional G4 VMs can be managed by Google Kubernetes Engine (GKE), allowing developers to use advanced container binpacking to achieve even higher price-performance and resource utilization. When managed through Dynamic Workload Scheduler, you can set fallback priorities for fractional slices. This significantly improves obtainability by allowing the scheduler to automatically find available GPU configurations for each workload.

“The G4 vGPU’s flexible sizing allows us to precisely tailor compute resources to the scale of each molecular simulation, ensuring maximum efficiency across our drug discovery pipeline. This granular control means our researchers can seamlessly pivot between smaller workflows and massive parallel processing without being constrained by fixed hardware configurations.” – Shane Brauner, EVP, CIO, Schrödinger

Scaling AI Hypercomputer with NVIDIA Vera Rubin NVL72

Building on our deep engineering partnership with NVIDIA, we’re proud to support the successor to NVIDIA Blackwell architecture, the recently announced NVIDIA Vera Rubin platform. We plan to be among the first cloud providers to offer NVIDIA Vera Rubin NVL72 rack-scale systems in the second half of 2026, integrating them into our AI Hypercomputer architecture to empower the next generation of reasoning and agentic AI.

Delivering efficiency across the AI infrastructure stack

As part of our commitment to a fully open ecosystem, we are excited to announce the integration of Dynamo and GKE Inference Gateway. This integration provides a modular, open-source control plane across the application layer and the hardware. By combining Dynamo with Inference Gateway on GKE, teams can tailor their infrastructure to their exact needs, allowing them to extract the maximum ROI from accelerators, accelerate time-to-market for new AI models, and future-proof their deployments.

You can learn to maximize performance for massive MoE architectures through new advanced scaling recipes for A4X VMs (powered by NVIDIA GB200 NVL72 and Dynamo). These configurations show how to overcome memory and interconnect bottlenecks when running AI inference workloads on AI Hypercomputer.

We are also enhancing resource obtainability through the Dynamic Workload Scheduler, with Calendar Mode and Flex Start for A4X and A4X Max (powered by NVIDIA GB300 NVL72), as well as new Flex Start support for G4 VMs. Dynamic Workload Scheduler lets you reserve the precise capacity that you need, or use flexible start windows.

Snap, a long-time Google Cloud customer, achieved significant cost savings by migrating two of its primary data processing pipelines to Google Cloud G2 VMs powered by NVIDIA L4 Tensor Core GPUs. This was made possible by leveraging Spark on GKE alongside NVIDIA’s new cuDF libraries, which automated the optimization of its shuffle-heavy workloads for optimal GPU efficiency. Learn more at GTC session S81678.

Advancing Vertex AI training and Model Garden

We are meeting the demands of next-generation AI with two major infrastructure advancements to Vertex AI training clusters. First, support for A4X VM domains lets you leverage Vertex AI’s managed infrastructure and framework capabilities for massive-scale training on NVIDIA GB200 NVL72 rack-scale systems. To ensure these intensive workloads remain uninterrupted, new hardware resiliency capabilities let you apply configurable, proactive fault detection scans, which identify and mitigate potential hardware issues before they can disrupt critical “hero” training runs. These capabilities enable higher goodput and helps ensure that multi-week training jobs stay on track without costly restarts.

“We are setting a new standard for the agentic enterprise — delivering highly capable, consistent, accurate, and responsive AI agents with Google and NVIDIA. By leveraging Vertex AI training clusters on NVIDIA GB200 NVL72 to power our Agentforce 360 Platform, we’ve eliminated infrastructure bottlenecks to keep our GPUs fully saturated. This high-performance, resilient architecture allows our researchers to focus on innovation at scale, driving substantial gains for our most complex reasoning workloads.” - Silvio Savarese, Chief Scientist, Salesforce

At the same time, we continue to broaden Vertex AI Model Garden with support for NVIDIA’s Nemotron 3 family of open models. These include the Nemotron 3 Nano, featuring one-click deployment to simplify integration into private VPCs. We’ve also expanded our catalog to include the NVIDIA Nemotron 3 Super 120B model for immediate access to high-performance, large-scale reasoning. To maximize the value of these models, we’ve integrated NVIDIA’s latest performance libraries directly into Vertex AI to optimize popular open-source models on NVIDIA TensorRT-LLM.

To enable the community to get hands-on with NVIDIA Nemotron on Google Cloud, we are also launching the NVIDIA Nemotron model reasoning challenge on Kaggle, powered by G4 VMs. The competition invites the community to improve Nemotron 3 Nano’s reasoning accuracy on a new benchmark using techniques such as prompting, synthetic data generation, data curation, and fine-tuning – all running on cost-efficient G4 infrastructure so participants can iterate quickly and share their methods with the broader ecosystem. To learn more and register, visit the Kaggle competition page.

Empowering public sector AI startups

To foster continued innovation within the ecosystem, Google Public Sector and NVIDIA are launching an AI startup accelerator program. This year-long initiative will support a select cohort of AI-focused Independent Software Vendors (ISVs) building solutions for the public sector.

Participants gain dual access to both NVIDIA Inception and Google Cloud’s ISV accelerator resources. Kicking off at GTC and continuing through Google Cloud Next, this joint program will equip emerging technology leaders with the co-engineered infrastructure, technical guidance, and go-to-market support required to scale mission-critical public sector applications. To learn more about the program, please complete the interest form. Additional cohorts will be selected and announced in the future.

Co-engineering collaboration powers every layer of the AI stack

The transition to complex, agentic AI demands more than just raw compute. It requires a fully optimized, co-engineered stack. By integrating flexible hardware like fractional G4 instances and the upcoming Vera Rubin platform into our AI Hypercomputer architecture, and pairing it with deep software co-engineering, we provide the scale, resilience, and efficiency you need to turn your most ambitious AI visions into reality.

Coming to GTC? Stop by booth #513 to learn more and talk to our team. And you can always learn more about our collaboration with NVIDIA at cloud.google.com/NVIDIA.

H4D VMs, now GA, deliver exceptional performance and scaling for HPC workloads

Wed, 04 Mar 2026 17:00:00 +0000

Today, we’re announcing the general availability of H4D VMs, our latest high performance computing (HPC)-optimized VM, powered by the 5th Generation AMD EPYC™ processors. H4D VMs deliver exceptional performance, scalability, and value for industries like manufacturing, health care and life sciences, weather forecasting, and electronic design automation (EDA). H4D supports orchestration via Cluster Toolkit with Slurm and via Google Kubernetes Engine (GKE). Each approach allows for near-instant deployment and scaling of demanding workloads.

For the first time, the Google Cloud CPU portfolio features a VM family with Cloud Remote Direct Memory Access (RDMA). H4D’s RDMA is on the Titanium network adapter and lets you scale single-node H4D performance to multiple nodes, accelerating large production workloads.

Faster time to solution across domains and scales

Powered by the high core density of the 5th Gen AMD EPYC CPU and Google’s innovative, low-latency Falcon hardware transport, H4D VMs enable you to iterate and discover faster than ever before.

We demonstrated H4D performance through a series of industry-standard benchmarks, showing its capabilities across diverse domains and problem sizes.

Healthcare and life sciences
For researchers in healthcare and life sciences (HCLS), H4D VMs accelerate complex molecular simulations critical to scientific discovery. Compared to our previous C2D VMs, H4D VMs deliver up to a 4.3X speedup running LAMMPs (LJ benchmark) at 96 VMs, delivering 95% parallel efficiency on 18k cores. For drug discovery, we demonstrated a 5.8X speed-up using GROMACS (water_33m) at 32 VMs delivering 72% parallel efficiency on 6k cores. H4D also delivers further scalability, which we demonstrated by running the LAMMPS LJ benchmark on 192 VMs (~37k cores) while maintaining 92% parallel efficiency (see Figure 3).

Manufacturing
For manufacturing, H4D VMs help engineers shorten design cycles, run larger simulations, and iterate faster by delivering a strong performance boost for mission-critical Computer-Aided Engineering (CAE) workflows. Compared to our previous C2D VMs when running complex Computational Fluid Dynamics (CFD) simulations, H4D VMs deliver a 4.1X speedup running Ansys Fluent (F1_RaceCar_140m benchmark) on 32 VMs with 85% parallel efficiency. When running open-source OpenFOAM (Motorbike_100m), we demonstrated a 5.2X speedup over C2D using 16 VMs and achieving superlinear parallel efficiency of 122%.

A new standard for HPC price/performance

H4D VMs are designed to deliver the best price-performance for HPC workloads on Google Cloud by pairing superior performance with flexible consumption models. H4D supports Dynamic Workload Scheduler (DWS), which adapts to your workflow with Flex Start mode for just-in-time capacity and Calendar mode for guaranteed reservations. This allows you to access compute for as low as 3 cents per core-hour without long-term commitments. The resulting performance and cost efficiencies over previous generation VMs are detailed in Figures 6 and 7.

Comprehensive HPC management

To manage and deploy large, dense clusters of H4D VMs, you can leverage Google Cloud’s Cluster Director, which offers advanced maintenance capabilities (you can sign up for the preview here) alongside the Cluster Toolkit for rapid cluster deployment via turnkey system blueprints. For job and workload management, H4D VMs integrate with Batch, Google Cloud’s fully managed, cloud-native service that handles queuing, scheduling, and resource provisioning. Additionally, there’s support for DWS, which can be used in both Calendar mode for future reservations and Flex Start mode for time-limited, on-demand usage.

What customers and partners are saying

“We were able to test the H4D platform in early access at Jump Trading, and were extremely impressed with the results. The successful testing process demonstrated that H4D offers the performance, stability, and efficiency we require for demanding, high-volume operations. We see up to 50% better price/performance compared to prior generation machines and are now accelerating integration with our critical grid workloads on Google Cloud." - Alex Davies, Chief Technology Officer & Benjamin Stromski, HPC Linux Engineering, Jump Trading

“There lingers, especially in large-scale and compute-intensive domains, the idea that the fastest systems can only be built on premises and run on bare metal hardware. Terms such as ‘hypervisor tax” are often thrown around as justification for operating with bare metal. Our testing paints a different picture. The Google H4D VM performs better on our financial risk benchmark than the bare metal top of stack AMD CPU of the same generation." - Hamza Mian/CEO, HMxLabs

"As a leading provider of managed HPC solutions for the demanding CAE and manufacturing sectors, our evaluation of the H4D platform was focused heavily on its ability to handle our clients' largest, most tightly-coupled simulation workloads. We are extremely impressed with the results. The testing confirmed that the underlying RDMA fabric exhibits the outstanding low-latency and high-bandwidth performance required for massive parallel processing. This level of interconnect efficiency is non-negotiable for speeding up critical manufacturing simulations like crash testing and CFD. H4D has proven itself to be a true accelerator for high-throughput engineering workloads, and we are excited about its potential to redefine the performance ceiling for HPC in the engineering world." - Rodney Mach/President, TotalCAE

“The new H4D instances are a significant step forward for our demanding next-generation TPU simulation workloads. We've seen a 30% performance improvement across a variety of EDA benchmarks compared to C2D, demonstrating the strong single core performance of H4D. This directly translates to faster development cycles and allows our engineering teams to iterate more quickly” - Trevor Switkowski, Technical Lead of Chip Design Methodology, Google Cloud

Experience H4D today

H4D is now available in us-central1-a (Iowa), europe-west4-b (Netherlands) and asia-southeast1-a (Singapore) with additional regions coming soon. Check regional availability on our Regions and Zones page and deploy your most demanding HPC workloads by leveraging Cloud RDMA.

_{The following configurations were run for the above benchmarks:LAMMPS version 20250722, GROMACS: version 2023.1, OpenFOAM version 2312, Ansys Fluent version 2024R1. All runs used IntelMPI 2021.17.2. C2D/C3D/C4D used TCP, H4D used RDMA with RXM & SAR_LIMIT=2G. All runs used full ppn (processes-per-node) available on each platform (56, 180, 192 for C2D, C3D and C4D/H4D respectively). Ansys Fluent runs used 168ppn on H4D and variable ppn for C4D. SMT off for all. Cost comparision across single nodes of H4D-highmem-192 with DWS Flex Start price, c3d-standard-360 and c2d-standard-112 OD price.}

_{Parallel efficiency and optimal node count depend on input size and communication patterns, and therefore vary across workloads.}

Simpler billing, clearer savings: A FinOps guide to updated spend-based CUDs

Thu, 12 Feb 2026 17:00:00 +0000

Optimizing cloud spend is one of the most rewarding aspects of FinOps — and committed use discounts (CUDs) remain one of the most effective levers to pull.

In July 2025, we began rolling out updates to the spend-based CUD model to make it easier to understand your costs and savings, expand coverage to new SKUs (including Cloud Run and H3/M-series VMs), and offer increased flexibility. These changes are now available to all customers. Let’s dive into how this new model simplifies your FinOps practice.

1. What is the spend-based CUD data change all about?

The most important shift is the move from a credit-based system to a direct discounted price model using consumption models.

Under the old credits model, you committed to an hourly on-demand amount. To find your savings (the actual cost reduction realized), you had to use three different numbers: the full on-demand cost, the commitment fee, and the offsetting credit.

1. The old math:

$10.00 (On-demand) + $5.50 (Commitment fee) - $10.00 (Credit) = $5.50 (Net Cost)
Savings = $10.00 (On-demand) - $5.50 (Net costs) = $4.50

With the new direct discount model, you don’t need to do that math to calculate your net costs. You commit directly to the net, discounted spend amount. Your usage is simply billed at that discounted rate.

2. The new math:

$5.50 (Discounted costs)
Savings = $10.00 (On-demand) - $5.50 (Discounted costs) = $4.50

You can now see your net cost at a glance, and calculating the savings only requires comparing the on-demand price ($10.00) to your new discounted cost ($5.50), which equals $4.50/hr.

2. How do I validate my savings before and after the changes?

The unified CUD Analysis tool is your best resource for auditing the migration or performing deep-dives on your spend. CUD Analysis for the new spend-based CUD model allows you to quickly verify the savings you are getting with the new model, and you can use this tool to compare that the savings didn’t change between the old and the new model.

You can validate your savings by following these steps:

1. Identify the date when the migration took place; you can see the migration date in the billing overview page.

2. Go to CUD Analysis to validate the savings before and after the migration.

3. To quantify costs from before the migration:

Filter the view for one day before the migration, in this case Oct. 26, 2025.
Select a CUD Product, for example Cloud SQL CUD.
In our example, we paid a $50.35 CUD fee to get a $69.12 credit. When you subtract that fee from the credit, your actual take-home savings were $18.77.

4. To validate costs after the migration

Change the date to Oct. 28, 2025
Under the new model, you pay the discounted rates upfront. Your dashboard will reflect a Net Cost of $50.35, compared to the $69.12 on-demand cost, clearly showing your $18.77 in savings.

In addition, this release also includes an update to Cost Reports to include “Savings Programs,” which accurately reflects your actual net savings ($18.77 in our example above), rather than gross credit. When comparing pre- and post-migration data in Cost Reports, ensure you include both usage SKUs and commitment fee SKUs to capture the full scope of the commitment.

3. What other capabilities are in the new CUD Analysis?

Beyond support for the new model, the new CUD Analysis tool offers deeper visibility into your CUD coverage and CUD utilization. You can now analyze your CUDs with hourly data granularity for up to 30 days. This is a major improvement for FinOps teams, as daily averages often hide underutilization spikes that occur during specific hours.

CUD Analysis: Compute Flexible CUD coverage analysis

CUD Analysis: Per CUD purchase utilization visibility

If you want to use your own data analysis tools, we offer a new spend-based CUD metadata export that lets you manage your spend-based CUDs programmatically. You can use this export to join with the Billing BigQuery Export datasets to run in-depth, programmatic analysis on all your commitment data. You can also export a CSV from the CUD Analysis view to see the raw data for every resource and its price without needing the full BigQuery export.

4. How much commitment should I buy?

Our CUD recommendations are the primary tool for determining how much of a commitment to purchase. We recently enhanced our Compute Flexible CUD commitment recommendations to provide greater accuracy by including data from GKE, Cloud Run, Cloud Run Functions, and Compute Engine. Additionally, CUD scenario modeling allows you to adjust these suggestions in real-time. You can adjust coverage thresholds, filter out specific dates with irregular usage, or extend the lookback analysis window up to 180 days to identify the exact commitment level that aligns with your specific risk profile.

CUD scenario modeling: experiment with multiple options to identify your ideal CUD strategy

5. Is there anything else I should know about Flex CUDs?

With the release of the new spend-based model, we’ve addressed the reporting limitation affecting customers who use a combination of Flex CUDs and GKE/Cloud Run CUDs. Previously, our analysis tools were unable to accurately identify the source of specific credits, leading to discrepancies in KPI metrics like savings, coverage, and utilization. Under the new spend-based CUD model, this limitation has been corrected, so your CUD analysis now provides an accurate, granular view of your savings per Google Cloud service.

To begin navigating the updated spend-based model, visit the Billing console. You can learn more in our documentation:

High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run

Mon, 02 Feb 2026 17:00:00 +0000

Running large-scale inference models can involve significant operational toil, including cluster management and manual VM maintenance. One solution is to leverage a serverless compute platform to abstract away the underlying infrastructure. Today, we’re bringing the serverless experience to high-end inference with support for NVIDIA RTX PRO™ 6000 Blackwell Server Edition GPUs on Cloud Run. Now in preview, you can deploy massive models like Gemma 3 27B or Llama 3.1 70B with the 'deploy and forget' experience you’ve come to expect from Cloud Run. No reservations. No cluster management. Just code.

A powerful GPU platform

The NVIDIA RTX PRO 6000 Blackwell GPU provides a huge leap in performance compared to the NVIDIA L4 GPU, bringing 96GB vGPU memory, 1.6 TB/s of bandwidth and support for FP4 and FP6. This means you can serve up to 70B+ parameter models without having to manage any underlying infrastructure. Cloud Run lets you attach a NVIDIA RTX PRO 6000 Blackwell GPU to your Cloud Run service, job, or worker pools, on demand, with no reservations required. Here are some ways you can use the NVIDIA RTX PRO 6000 Blackwell GPU to accelerate your business:

Generative AI and inference: With its FP4 precision support, the NVIDIA RTX PRO 6000 Blackwell GPU’s high-efficiency compute accelerates LLM fine-tuning and inference, letting you create real-time generative AI applications such as multi-modal and text-to-image creation models. By running your model on Cloud Run services, you can also take advantage of rapid startup and scaling, going from zero instances to having a GPU with drivers installed under 5 seconds. When traffic eventually scales down zero and no more requests are being received, Cloud Run automatically scales your GPU instances down to zero.
Fine-tuning and offline inference: NVIDIA RTX PRO 6000 Blackwell GPUs can be used in conjunction with Cloud Run jobs to fine-tune your model. The fifth-generation NVIDIA Tensor Cores can be used in conjunction with AI models to help accelerate rendering pipelines and enhance content creation.
Tailored scaling for specialized workloads: Use GPU-enabled worker pools to apply granular control over your GPU workers, whether you need to dynamically scale based on custom external metrics or manually provision "always-on" instances for complex, stateful processing.

We built Cloud Run to be the simplest way to run production-ready, GPU-accelerated tasks. Some highlights of Cloud Run include:

Managed GPUs with flexible compute: Cloud Run pre-installs the necessary NVIDIA drivers so you can focus on your code. Cloud Run instances using NVIDIA RTX PRO 6000 Blackwell GPUs can configure up to 44 vCPU and 176GB of RAM.
Production-grade reliability: By default, Cloud Run offers zonal redundancy, helping to ensure enough capacity for your service to be resilient to a zonal outage; this also applies to Cloud Run with GPUs. Alternatively, you can turn off zonal redundancy and benefit from a lower price for best-effort failover of your GPU workloads in case of a zonal outage.
Tight integration: Cloud Run works natively with the rest of Google Cloud. You can load massive model weights by mounting Cloud Storage buckets as local volumes, or use Identity-Aware Proxy (IAP) to secure traffic that’s bound for a Cloud Run service.

Get started

The NVIDIA RTX PRO 6000 Blackwell GPU is available in preview on demand with availability in us-central1 and europe-west4, and limited availability in asia-south2 and asia-southeast1. You can deploy your first service using Ollama, one of the easiest way to run open models, on Cloud Run with NVIDIA RTX PRO 6000 GPUs enabled:

code_block: <ListValue: [StructValue([('code', 'gcloud beta run deploy my-service \\\r\n--image ollama/ollama --port 11434 \\\r\n--cpu 20 --memory 80Gi \\\r\n--gpu-type nvidia-rtx-pro-6000 \\\r\n--no-gpu-zonal-redundancy \\\r\n--region us-central1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5b94240d30>)])]>

For more details, check out our updated Cloud Run documentation and AI inference best practices.

Unlock 2x better price-performance with Axion-based N4A VMs, now generally available

Tue, 27 Jan 2026 17:00:00 +0000

January 27, 2026: The N4A is now generally available. You can get started by deploying N4A from the Google Cloud console.

Decision makers and builders today face a constant challenge: managing rising cloud costs while delivering the performance their customers demand. As applications evolve to use scale-out microservices and handle ever-growing data volumes, organizations need maximum efficiency from their underlying infrastructure to support their growing general-purpose workloads.

To meet this need, we’re excited to announce our latest Axion-based virtual machine series: N4A, available in preview on Compute Engine, Google Kubernetes Engine (GKE), Dataproc, and Batch, with support in Dataflow and other services coming soon.

N4A is the most cost-effective N-series VM to date, delivering up to 2x better price-performance and 80% better performance-per-watt than comparable current-generation x86-based VMs. This makes it easier for customers to further optimize the Total Cost of Ownership (TCO) for a broad range of general-purpose workloads. We see this with cloud-native businesses running scale-out web servers and microservices on GKE, enterprise teams managing backend application servers and mid-sized databases, and engineering organizations operating large CI/CD build farms.

At Google Cloud, we co-design our compute offerings with storage, networking and software at every layer of the stack, from orchestrators to runtimes, to deliver exceptional system-level performance and cost-efficiency. N4A’s breakthrough price-performance is powered by our latest-generation Google Axion Processors, built on the Arm® Neoverse® N3 compute core, Google Dynamic Resource Management (DRM) technology, and Titanium, Google Cloud’s custom-designed hardware and software system that offloads networking and storage processing to free up the CPU. Titanium is part of Google Cloud’s vertically integrated software stack — from the custom silicon in our servers to our planet-scale network traversing 7.75 million kilometers of terrestrial and subsea fiber across 42 regions — that is engineered to maximize efficiency and provide the ultra-low latency and high bandwidth to customers at global scale.

Redefining general-purpose compute and enabling AI inference

N4A is engineered for versatility, with a feature set to support your general-purpose and CPU-based AI workloads. It comes in predefined and custom shapes, with up to 64 vCPUs and 512GB of DDR5 in high-cpu (2GB of memory per vCPU), standard (4GB per vCPU), and high-memory (8GB per vCPU) configurations, with instance networking up to 50 Gbps of bandwidth. N4A VMs feature support for our latest generation Hyperdisk storage options, including Hyperdisk Balanced, Hyperdisk Throughput, and Hyperdisk ML (coming later), providing up to 160K IOPS, 2.4GB/s of throughput per instance.

N4A performs well across a range of industry-standard benchmarks that represent the key workloads our customers run every day. For example, relative to comparable current-generation x86-based VM offerings, N4A delivers up to 105% better price-performance for compute-bound workloads, up to 90% better price-performance for scale-out web servers, up to 85% better price-performance for Java applications, and up to 20% better price-performance for general-purpose databases.

Footnote: As of October 2025. Performance based on the estimated SPECrate®2017_int_base, estimated SPECjbb2015, MySQL Transactions/minute (RO), and Google internal Nginx Reverse Proxy benchmark scores run in production on comparable latest-generation generally-available VMs with general purpose storage types. Price-performance claims based on published and upcoming list prices for Google Cloud.

In the real world, early adopters are seeing dramatic price-performance improvements from the new N4A instances.

"At ZoomInfo, we operate a massive data intelligence platform where efficiency is paramount. Our core data processing pipelines, which are critical for delivering timely insights to our customers, run extensively on Dataflow and Java services in GKE. In our preview of the new N4A instances, we measured a 60% improvement in price-performance for these key workloads compared to their x86-based counterparts. This allows us to scale our platform more efficiently and deliver more value to our customers, faster." - Sergei Koren, Chief Infrastructure Architect, ZoomInfo

“Organizations today need performance, efficiency, flexibility, and scale to meet the computing demands of the AI era; this requires the close collaboration and co-design that is at the heart of our partnership with Google Cloud. As N4A redefines cost-efficiency, customers gain a new level of infrastructure optimization, enabling enterprises to choose the right infrastructure for their workload requirements with Arm and Google Cloud.” - Bhumik Patel, Director, Server Ecosystem Development, Infrastructure Business, Arm

Granular control with Custom Machine Types and Hyperdisk

A key advantage of our N-series VMs has always been flexibility, and with N4A, we are bringing one of our most popular features to the Axion family for the first time: Custom Machine Types (CMT). Instead of fitting your workload into a predefined shape, CMTs on N4A lets you independently configure the amount of vCPU and memory to meet your application's unique needs. This ability to right-size your instances means you pay only for the resources you use, minimizing waste and optimizing your total cost of ownership.

This same principle of matching resources to your specific workload applies to storage. N4A VMs feature support for our latest generation of Hyperdisk, allowing you to select the perfect storage profile for your application's needs:

Hyperdisk Balanced: Offers an optimal mix of performance and cost for the majority of general-purpose workloads, with up to 160K IOPs per N4A VM.
Hyperdisk Throughput: Delivers up to 2.4GiBps of max throughput for bandwidth-intensive analytics workloads like Hadoop or Kafka, providing high-capacity storage at an excellent value.
Hyperdisk ML (post GA): Purpose-built for AI/ML workloads, allows you to attach a single disk containing your model weights or datasets to up to 32 N4A instances simultaneously for large-scale inference or training tasks.
Hyperdisk Storage Pools: Instead of provisioning capacity and performance on a per-volume basis, allows you to provision performance and capacity in aggregate, further optimizing costs by up to 50% and simplifying management.

"At Vimeo, we have long relied on Custom Machine Types to efficiently manage our massive video transcoding platform. Our initial tests on the new Axion-based N4A instances have been very compelling, unlocking a new level of efficiency. We've observed a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs. This points to a clear path for improving our unit economics and scaling our services more profitably, without changing our operational model." - Joe Peled, Sr. Director of Hosting & Delivery Ops, Vimeo

A growing Arm-based Axion portfolio for customer choice

C-series VMs are designed for workloads that require consistently high performance, e.g., medium-to-large-scale databases and in-memory caches. Alongside them, N-series VMs have been a key Compute Engine pillar, offering a balance of price-performance and flexibility, lowering the cost of running workloads with variable resource needs such as scale-out Java/GKE workloads. We released our first Axion-based machine series, C4A, in October 2024, and the introduction of N4A complements C4A, providing a range of Google Axion instances suited to your workloads’ precise needs.

On top of that, GKE unlocks significant price-performance advantages by orchestrating Axion-based C4A and N4A machine types. GKE leverages Custom Compute Classes to provision and mix these machine types, matching workloads to the right hardware. This automated, heterogeneous cluster management allows teams to optimize their total cost of ownership across their entire application stack.

Also joining the Axion family is C4A.metal, Google Cloud’s first Axion bare metal instance that helps builders meet use cases that require access to the underlying physical server to run specialized applications in a non-virtualized environment, such as automotive systems development, workloads with strict licensing requirements, and Android software development. C4A.metal will be available in preview soon.

Supported by the broad and mature Arm ecosystem, adopting Axion is easier than ever, and the combination of C4A and N4A can help you lower the total cost of running your business, without compromising on performance or workload-specific requirements:

N4A for cost optimization and flexibility. Deliberately engineered for general-purpose workloads that need a balance of price and performance, including scale-out web servers, microservices, containerized applications, open-source databases, batch, data analytics, development environments, data preparation and AI/ML experimentation.
C4A for consistently high performance, predictability, and control. Powering workloads where every microsecond counts, such as medium- to large-scale databases, in-memory caches, cost-effective AI/ML inference, and high-traffic gaming servers. C4A delivers consistent performance, offering a controlled maintenance experience for mission-critical workloads, networking bandwidth up to 100 Gbps, and next-generation Titanium Local SSD storage.

"Migrating to Google Cloud's Axion portfolio gave us a critical competitive advantage. We slashed our compute consumption by 20% while maintaining low and stable latency with C4A instances, such as our Supply-Side Platform (SSP) backend service. Additionally, C4A enabled us to leverage Hyperdisk with precisely the IOPS we need for our stateful workloads, regardless of instance size. This flexibility gives us the best of both worlds - allowing us to win more ad auctions for our clients while significantly improving our margins. We're now testing the N4A family by running some of our key workloads that require the most flexibility, such as our API relay service. We are happy to share that several applications running in production are consuming 15% less CPU compared to our previous infrastructure, reducing our costs further, while ensuring that the right instance backs the workload characteristics required.” - Or Ben Dahan, Cloud & Software Architect at Rise

Get started with N4A today

N4A is available in the following Google Cloud regions: us-central1 (Iowa), us-east4 (Virginia), us-east1 (South Carolina), us-west1 (Oregon), asia-southeast1 (Singapore), europe-west1 (Belgium), europe-west2 (London), europe-west3 (Frankfurt) and europe-west4 (Netherlands) with more regions to follow. Learn more about N4A here in documentation; deploy N4A here in the console.

Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo

Thu, 22 Jan 2026 17:00:00 +0000

As organizations transition from standard LLMs to massive Mixture-of-Experts (MoE) architectures like DeepSeek-R1, the primary constraint has shifted from raw compute density to communication latency and memory bandwidth. Today, we’re releasing two new validated recipes designed to help customers overcome the infrastructure bottlenecks of the agentic AI era. These new recipes provide clear steps to optimize both throughput and latency built on the A4X machine series powered by NVIDIA GB200 NVL72 and NVIDIA Dynamo, which extend the reference architecture we published in September 2025 for disaggregated inference on A3 Ultra (NVIDIA H200) VMs.

We’re bringing the best of both worlds to AI infrastructure by combining the multi-layered scalability of Google Cloud’s AI infrastructure with the rack-scale acceleration of the A4X. These recipes are part of a broader collaboration between our organizations that includes investments in important inference infrastructure like Dynamic Resource Allocation (DRA) and Inference Gateway.

Highlights of the updated reference architecture include:

Infrastructure: Google Cloud’s A4X machine series, powered by NVIDIA GB200 NVL72, creating a single 72-GPU compute domain connected with fifth-generation NVIDIA NVLink.
Serving architecture: NVIDIA Dynamo functions as the distributed runtime, managing KV cache state and kernel scheduling across the rack-scale fabric.
Performance: For 8K/1K input sequence length (ISL)/ output sequence length (OSL) , we achieved over 6K total tokens/sec/GPU in throughput-optimized configurations and 10ms inter-token latency (ITL) in latency-optimized configurations.
Deployment: Two new recipes are available today in this repo for deploying this stack on Google Cloud using Google Kubernetes Engine (GKE) for orchestration.

The modern inference stack

To achieve exascale performance, inference cannot be treated as a monolithic workload. It requires a modular architecture where every layer is optimized for specific throughput and latency targets. The AI Hypercomputer inference stack consists of three distinct layers:

Infrastructure layer: The physical compute, networking, and storage fabric (e.g., A4X).
Serving layer: The specific model architecture and the optimized execution kernels (e.g., NVIDIA Dynamo, NVIDIA TensorRT-LLM, Pax) and runtime environment managing request scheduling, KV cache state, and distributed coordination.
Orchestration layer: The control plane for resource lifecycle management, scaling, and fault tolerance (e.g., Kubernetes).

In the reference architecture detailed below, we focus on a specific, high-performance instantiation of this stack designed for the NVIDIA ecosystem. We combine the A4X at the infrastructure layer with NVIDIA Dynamo at the model serving Layer, orchestrated by GKE.

Infrastructure layer: The A4X rack-scale architecture

In our A4X launch announcement in February 2025 we referenced how the A4X VM addressed bandwidth constraints by implementing the GB200 NVL72 architecture, which fundamentally alters the topology available to the scheduler.

Unlike previous generations where NVLink domains were bound by the server chassis (typically 8 GPUs), the A4X exposes a unified fabric, with:

72 NVIDIA Blackwell GPUs interconnected via the NVLink Switch System that enables the 72 GPUs to operate as one giant GPU with unified shared memory
130TB/s aggregate bandwidth, enabling all-to-all communication with latency profiles comparable to on-board memory access (72 GPUs x 1.8 TB/s/GPU)
Native NVFP4 support: Blackwell Tensor Cores support 4-bit floating point precision, effectively doubling throughput relative to FP8 for compatible model layers. We used FP8 Precision Scaling for this benchmark to support configuration and comparison with previously published results.

Serving layer: NVIDIA Dynamo

Hardware of this scale requires a runtime capable of managing distributed state without introducing synchronization overhead. NVIDIA Dynamo serves as this distributed inference runtime. It moves beyond simple model serving to coordinate the complex lifecycle of inference requests across the underlying infrastructure.

The serving layer optimizes utilization on the A4X through these specific mechanisms:

Wide Expert Parallelism (WideEP): Traditional MoE serving shards experts within a single node (typically 8 GPUs), leading to load imbalances when specific experts become "hot." We use the A4X's unified fabric to distribute experts across the full 72-GPU rack. This WideEP configuration absorbs bursty expert activation patterns by balancing the load across a massive compute pool, helping to ensure that no single GPU becomes a straggler.
Deep Expert Parallelism (DeepEP): While WideEP distributes the experts, DeepEP optimizes the critical "dispatch" and "combine" communication phases. DeepEP accelerates the high-bandwidth all-to-all operations required to route tokens to their assigned experts. This approach minimizes the synchronization overhead that typically bottlenecks MoE inference at scale.
Disaggregated request processing: Dynamo decouples the compute-bound prefill phase from the memory-bound decode phase. On the A4X, this allows the scheduler to allocate specific GPU groups within the rack to prefill (maximizing tensor core saturation) while other GPUs handle decode (maximizing memory bandwidth utilization), preventing resource contention.
Global KV cache management: Dynamo maintains a global view of the KV cache state. Its routing logic directs requests to the specific GPU holding the relevant context, minimizing redundant computation and cache migration.
JIT kernel optimization: The runtime leverages NVIDIA Blackwell-specific kernels, performing just-in-time fusion of operations to reduce memory-access overhead during the generation phase.

Orchestration layer: Mapping software to hardware

While the A4X provides the physical fabric and Dynamo provides the runtime logic, the orchestration layer is responsible for mapping the software requirements to the hardware topology. For rack-scale architectures like the GB200 NVL72, container orchestration needs to evolve beyond standard scheduling. By making the orchestrator explicitly aware of the physical NVLink domains, we can fully unlock the platform’s performance and help ensure optimal workload placement.

GKE enforces this hardware-software alignment through these specific mechanisms:

1. Rack-level atomic scheduling: With GB200 NVL72, the "unit of compute" is no longer a single GPU or a single node — the entire rack is the new fundamental building block of accelerated computing. We use GKE capacity reservations with specific affinity settings. This targets a reserved block of A4X infrastructure that guarantees dense deployment. By consuming this reservation, GKE helps ensure that all pods comprising a Dynamo instance land on the specific, physically contiguous rack hardware required to establish the NVLink domain, providing the hard topology guarantee needed for WideEP and DeepEP.

2. Low-latency model loading via GCS FUSE: Serving massive MoE models requires loading terabytes of weights into high-bandwidth memory (HBM). Traditional approaches that download weights to local disk incur unacceptable "cold start" latencies. We leverage the GCS FUSE CSI Driver to mount model weights directly from Google Cloud Storage as a local file system. This allows the Dynamo runtime to "lazy load" the model, streaming data chunks directly into GPU memory on demand. This approach eliminates the pre-download phase, significantly reducing the time-to-ready for new inference replicas and enabling faster auto-scaling in response to traffic bursts.

3. Kernel-bypass networking (GPUDirect RDMA): To maximize the aggregate 130 TB/s bandwidth of the A4X, the networking stack must minimize CPU and I/O involvement. We configure the GKE cluster to enable GPUDirect RDMA over the Titanium network adapter. By injecting specific NCCL topology configurations and enabling IPC_LOCK capabilities in the container, we allow the application to bypass the OS kernel and perform Direct Memory Access (DMA) operations between the GPU and the network interface. This configuration offloads the NVIDIA Grace CPU from data path management, so that networking I/O does not become a bottleneck during high-throughput token generation.

Performance validation

We observed the following when assessing the scaling characteristics of an 8K/1K workload on DeepSeek-R1 (FP8) with SGLang for two distinct optimization targets.

1. Throughput-optimized configuration

Setup: All 72 GPUs utilizing DeepEP. 10 prefill nodes with 5 workers (TP8) and 8 decode nodes with 1 worker (TP32).
Result: We sustained over 6K total tokens/sec/GPU (1.5K output tokens/sec/GPU), which matches the performance published by InferenceMAX (source).

2. Latency-optimized configuration

Setup: 8 GPUs (two nodes) without DeepEP. 1 prefill node with 1 prefill worker (TP4) and 1 decode node with 1 decode worker (TP4).
Result: We sustained a median Inter-Token Latency (ITL) of 10ms at a concurrency of 4, which matches the performance published by InferenceMAX (source).

Looking ahead

As models evolve from static chat interfaces to complex, multi-turn reasoning agents, the requirements for inference infrastructure will continue to shift. We are actively updating and releasing benchmarks and recipes as we invest across all three layers of the AI inference stack to meet these demands:

Infrastructure layer: The recently released A4X Max is based on the NVIDIA GB300 NVL72 in a single 72 GPU rack configuration, bringing 1.5X more NVFP4 FLOPs, 1.5X more GPU memory, and 2X higher network bandwidth compared to A4X.
Serving layer: We are actively exploring deeper integrations with components of NVIDIA Dynamo, e.g., pairing KV Block Manager with Google Cloud remote storage, funneling Dynamo metrics into our Cloud Monitoring dashboards for enhanced observability, and leveraging GKE Custom Compute Classes (CCC) for better capacity and obtainability, as well as setting a new baseline with FP4 precision.
Orchestration: We plan to incorporate additional optimizations into these tests, e.g. Inference Gateway as the intelligent inference scheduling component, following the design patterns established in the llm-d well-lit paths. We aim to provide a centralized mechanism for sophisticated traffic orchestration — handling request prioritization, queuing, and multi-model routing before the workload ever reaches the serving-layer runtime.

Whether you are deploying massive MoE models or architecting the next generation of reasoning agents, this stack provides the exascale foundation required to turn frontier research into production reality.

Get started today

At Google Cloud, we’re committed to providing the most open, flexible, and performant infrastructure for your AI workloads. With full support for the NVIDIA Dynamo suite — from intelligent routing and scaling to the latest NVIDIA AI infrastructure — we provide a complete, production-ready solution for serving LLMs at scale.

We updated our deployment repository with two specific recipes for the A4X machine class:

Recipe for throughput optimized - 72 GPUs with DeepEP
Recipe for latency optimized - 8 GPUs without DeepEP

We look forward to seeing what you build!

Simplify VM OS agent management at scale: Introducing VM Extensions Manager

Mon, 05 Jan 2026 17:00:00 +0000

If you're an IT administrator, you know that managing Operating System (OS) agents (Google calls them extensions) across a large fleet of VM instances can be complex and frustrating. Indeed, this operational overhead can be a major barrier to adopting extension-based services on VM fleets, despite the fact that they unlock powerful application-level capabilities.

To solve this problem, we’re excited to announce the preview of VM Extensions Manager, a new capability integrated directly into the Compute Engine API that simplifies installing and managing these Google-provided extensions.

VM Extensions Manager provides a centralized, policy-driven framework for managing the entire lifecycle of Google Cloud extensions on your VM instances. Instead of relying on manual scripts, startup scripts, or other bespoke solutions, you can now define a policy to ensure all your VM instances — both existing and new — conform to that state, reducing operational overhead from months to hours.

How to get started with VM Extensions Manager

VM Extensions Manager is integrated directly into the compute.googleapis.com API, meaning there are no new APIs to discover or enable. You can get started in minutes.

1. Define your extension policy
First, define a policy that specifies the desired state of your extensions.

For the preview, you can create zonal policies at the Project level. This policy targets VM instances within a single, specific zone.

Over the coming months, we’ll expand support to include global policies, as well as policies at the Organization and Folder levels. This will allow you to build a flexible hierarchy of policies (using priorities) to manage your extension on your enterprise fleet from a single control plane.

You can create this policy directly from the Google Cloud console:

Demo of Creating VM Extension policy using Cloud Console

2. Select your extensions
In the policy, you select the Google Cloud extensions you want to manage. For the preview, VM Extensions Manager supports several critical Google Cloud extensions, including:

Cloud Ops Agent (ops-agent): The Ops Agent is the primary agent for collecting telemetry from your Compute Engine instances.
Agent for SAP (sap-extension): Google Cloud's Agent for SAP is provided by Google Cloud for the support and monitoring of SAP workloads running on Compute Engine instances and Bare Metal Solution servers.
Agent for Compute Workload (workload-extension): The Agent for Compute Workloads lets you monitor and evaluate workloads running on Compute Engine.

We'll be adding support for more extension-based services in the coming months.

You can choose to pin a specific extension version or, keep it empty (the default) to get the latest extension installed. If you choose the default, VM Extensions Manager automatically handles the rollout of new versions as they are released — no more waiting to access new features and improvements.

3. Roll out global policy with more control
VM Extensions Manager gives you control over how global policy changes are deployed across many zones with rollout speeds. Zonal policies don't offer rollout speeds; they are enforced instantaneously when the VMs are online.

In coming weeks, we will expand support for global policy via gcloud first and update the documentation with relevant information. UI updates will follow in coming months.

At preview, however, global policy lets you select two distinct rollout speeds:

SLOW (Recommended): This is the default option, designed for safety. It orchestrates a zone-by-zone rollout (within the scope of the policy) with a built-in wait time between waves, minimizing the potential blast radius of a problematic change over a period of time, by default 5 days. This is perfect for standard maintenance and updates.
FAST: This option eliminates the wait time between waves, executing the change across the entire fleet across zones as quickly as possible. It is intended for urgent use cases, such as deploying a critical security patch in a "break-glass" emergency scenario across all VMs in all zones.

Once you save the policy, VM Extensions Manager takes over. The underlying progressive rollout engine manages the complex orchestration, and you can monitor its progress.

A flexible system for standardization and control

VM Extensions Manager is designed to bring standardization and control to extensions on your VM fleets. You can start today by applying zonal policies to your projects to ensure extensions are correctly installed on VM instances in the correct zones.

To get started defining Extension policies for your Compute Engine VM instances, read the documentation to create your first policy. We're excited to see how you use VM Extensions Manager to standardize, secure, and simplify the management of your VM fleet.

Automate AI and HPC clusters with Cluster Director, now generally available

Wed, 17 Dec 2025 18:00:00 +0000

The complexity of the infrastructure behind AI training and high performance computing (HPC) workloads can really slow teams down. At Google Cloud, where we work with some of the world’s largest AI research teams, we see it everywhere we go: researchers hampered by complex configuration files, platform teams struggling to manage GPUs with home-grown scripts, and operational leads battling the constant, unpredictable hardware failures that derail multi-week training runs. Access to raw compute isn't enough. To operate at the cutting edge, you need reliability that survives hardware failures, orchestration that respects topology, and a lifecycle management strategy that adapts to evolving needs.

Today, we are delivering on those requirements with the General Availability (GA) of Cluster Director and the Preview of Cluster Director support for Slurm on Google Kubernetes Engine (GKE).

Cluster Director (GA) is a managed infrastructure service designed to meet the rigorous demands of modern supercomputing. It replaces fragile DIY tooling with a robust topology-aware control plane that handles the entire lifecycle of Slurm clusters, from the first deployment to the thousandth training run.
We are expanding Cluster Director to support Slurm on GKE (Preview), designed to give you the best of both worlds: the familiar precision of high-performance scheduling and the automated scale of Kubernetes. It achieves this by treating GKE node pools as a direct compute resource for your Slurm cluster, allowing you to scale your workloads with Kubernetes' power without changing your existing Slurm workflows.

Cluster Director, now GA

Cluster Director offers advanced capabilities at each phase of the cluster lifecycle, spanning preparation (Day 0), where infrastructure design and capacity are determined; deployment (Day 1), where the cluster is automatically deployed and configured; and monitoring (Day 2), where performance, health, and optimization are continuously tracked.

This holistic approach ensures that you get the benefits of fully configurable infrastructure while automating lower-level operations so your compute resources are always optimized, reliable, and available.

So, what does all this cost? That’s the best part. There's no extra charge to use Cluster Director. You only pay for the underlying Google Cloud resources — your compute, storage, and networking.

How Cluster Director supports each phase of deployment

Day 0: Preparation

Standing up a cluster typically involves weeks of planning, wrangling Terraform, and debugging the network. Cluster Director changes the ‘Day 0’ experience entirely, with tools for designing infrastructure topology that’s optimized for your workload requirements.

To streamline your Day 0 setup, Cluster Director provides:

Reference architectures: We’ve codified Google’s internal best practices into reusable cluster templates, enabling you to spin up standardized, validated clusters in minutes. This helps ensure that every team in your organization is using the same security standards for their deployments and deploying on infrastructure that is configured correctly by default — right down to the network topology and storage mounting.
Guided configuration: We know that having too many options can lead to configuration paralysis. The Cluster Director control plane guides you through a streamlined setup flow. You select your resources, and our system handles the complex backend mapping, ensuring that storage tiers, network fabrics, and compute shapes are compatible and optimized before you deploy.
Broad hardware support: Cluster Director offers full support for large-scale AI systems, including Google Cloud’s A4X and A4X Max VMs powered by NVIDIA GB200 and GB300 GPUs, and versatile CPUs such as N2 VMs for cost-effective login nodes and debugging partitions.
Flexible consumption options: Cluster Director integrates with your preferred procurement strategy, with support for Reservations for guaranteed capacity during critical training runs, Dynamic Workload Scheduler Flex-start for dynamic scaling, or Spot VMs for opportunistic low-cost runs.

"Google Cloud's Cluster Director is optimized for managing large-scale AI and HPC environments. It complements the power and performance of NVIDIA's accelerated computing platform. Together, we're providing customers with a simplified, powerful, and scalable solution to tackle the next generation of computing challenges." - Dave Salvator, Director of Accelerated Computing Products, NVIDIA

Day 1: Deployment

Deploying hardware is one thing, but maximizing performance is another thing entirely. Day 1 is the execution phase, where your configuration transforms into a fully operational cluster. The good news is that Cluster Director doesn't just provision VMs, it validates that your software and hardware components are healthy, properly networked, and ready to accept the first workload.

To ensure a high-performance deployment, Cluster Director automates:

Getting a clean "bill of health": Before your job ever touches a GPU, Cluster Director runs a rigorous suite of health checks, including DCGMI diagnostics and NCCL performance validation, to verify the integrity of the network, storage, and accelerators.
Keeping accelerators fed with data: Storage throughput is often the silent killer of training efficiency. That’s why Cluster Director fully supports Google Cloud Managed Lustre with selectable performance tiers, allowing you to attach high-throughput parallel storage directly to your compute nodes, so your GPUs are never starved for data.
Maximizing Interconnect Performance: To achieve peak scaling, Cluster Director implements topology-aware scheduling and compact placement policies. By utilizing dense reservations on Google’s non-blocking fabric, the system ensures that your distributed workloads are placed on the shortest physical path possible, minimizing tail latency and maximizing collective communication (NCCL) speeds from the get-go.

“Cluster Director is an amazing product, which has enabled me to spin up a ready to use Nvidia GPU cluster with Slurm, including all networking, routing, and high performance network file-system for large-scale distributed model training within less than an hour. The cluster was immediately ready to run our containerizedAI training workloads with excellent throughput with only minimal customization effort." - Dr. Florian Eyben, Head of AI Foundation Models & Speech Technology, Agile Robots SE, Munich, Germany

Day 2: Monitoring

The reality of AI and HPC infrastructure is that hardware fails and requirements change. A rigid cluster is an inefficient cluster. As you move into the ongoing “Day 2” operational phase, you need to maintain cluster health, maximize utilization and performance. Cluster Director provides a control plane equipped for the complexities of long-term operations. Today we are introducing new active cluster management capabilities to handle the messy reality of Day 2 operations.

New active cluster management capabilities include:

Topology-level visibility: You can’t orchestrate what you can’t see. Cluster Director’s observability graphs and topology grids let you visualize your entire fleet, spot thermal throttles or interconnect issues, and optimize job placement based on physical proximity.
One-click remediation: When a node degrades, you shouldn't have to SSH in to debug it. Cluster Director allows you to replace faulty nodes with a single click directly from the Google Cloud console. The system handles the draining, teardown, and replacement, returning your cluster to full capacity in minutes.
Adaptive infrastructure: When your research needs change, so should your cluster. You can now modify active clusters, with activities such as adding or removing storage filesystems, on the fly, without tearing down the cluster or interrupting ongoing work.

Cluster Director support for Slurm on GKE, now in preview

Innovation thrives in the open. Google, the creator of Kubernetes, and SchedMD, the developers behind Slurm, have long championed the open-source technologies that power the world's most advanced computing. For years, NVIDIA and SchedMD have worked in lockstep to optimize GPU scheduling, introducing foundational features like the Generic Resource (GRES) framework and Multi-Instance GPU (MIG) support that are essential for modern AI. By acquiring SchedMD, NVIDIA is doubling down on its commitment to Slurm as a vendor-neutral standard, ensuring that the software powering the world's fastest supercomputers remains open, performant, and perfectly tuned for the future of accelerated computing.

Building on this foundation of accelerated computing, Google is deepening its collaboration with SchedMD to answer a fundamental industry challenge: how to bridge the gap between cloud-native orchestration and high-performance scheduling. We are excited to announce the Preview of Cluster Director support for Slurm on GKE, utilizing SchedMD’s Slinky offering.

This initiative brings together the two standards of the infrastructure world. By running a native Slurm cluster directly on top of GKE, we are amplifying the strengths of both communities:

Researchers get the uncompromised Slurm interface and batch capabilities, such as sbatch and squeue, that have defined HPC for decades.
Platform teams gain the operational velocity that GKE, with its auto-scaling, self-healing, and bin-packing, brings to the table.

Slurm on GKE is strengthened by our long-standing partnership with SchedMD, which helps create a unified, open, and powerful foundation for the next generation of AI and HPC workloads. Request preview access now.

Try Cluster Director today

Ready to start using Cluster Director for your AI and HPC cluster automation?

Learn more about the end-to-end capabilities in documentation.
Activate Cluster Director in the console.

Google named a Leader in The Forrester Wave™: AI Infrastructure Solutions, Q4 2025

Wed, 17 Dec 2025 17:00:00 +0000

For most organizations, the question is no longer if they will use AI, but how to scale it from a promising prototype into a production-grade service that drives business outcomes. In this age of inference, competitive advantage is defined by your ability to serve useful information to users around the world at the lowest possible cost. As you move from demos to production deployments at scale, you need to simplify infrastructure operations with integrated systems that provide the latest AI software and accelerator hardware platforms, while keeping costs and architectural complexity low.

Yesterday, Forrester released The Forrester Wave™: AI Infrastructure Solutions, Q4 2025 report, evaluating 13 vendors, and we believe their findings validate our commitment to solving these core challenges. Google received the highest score of all vendors in the Current Offering category and received the highest possible score in 16 out of 19 evaluation criteria, including, but not limited to: Vision, Architecture, Training, Inferencing, Efficiency, and Security.

Access the full report: The Forrester Wave™: AI Infrastructure Solutions, Q4 2025

Accelerating time-to-value with an integrated system

Enterprises don’t run AI in a vacuum. They need to integrate it with a diverse range of applications and databases while adhering to stringent security protocols. Forrester recognized Google Cloud’s strategy of co-design by giving us the highest possible score in the Efficiency and Scalability criteria:

“Google pursues a strategy of silicon-infrastructure co-design. It develops TPUs to improve inference efficiency and NVIDIA GPUs for access to broader ecosystem compatibility. Google designs TPUs to integrate tightly with its networking fabric, giving customers high bandwidth and low latency for inference at scale.”

For over two decades, we have operated some of the world's largest services, from Google Search and YouTube to Maps, where their unprecedented scale required us to solve problems that no one else had. We couldn't simply buy the platform and infrastructure we needed; we had to invent it. This led to a decade-long journey of deep, system-level co-design, building everything from our custom network fabric and specialized accelerators to frontier models, all under one roof.

The result was an integrated supercomputing system, AI Hypercomputer, which has paid significant dividends for our customers. It supports a wide range of AI-optimized hardware, allowing you to optimize for granular, workload-level objectives — whether that's higher throughput, lower latency, faster time-to-results, or lower TCO. That means you can use our custom Tensor Processing Units (TPUs), the latest NVIDIA GPUs, or both, backed by a system that tightly integrates accelerators with networking and storage for exceptional performance and efficiency. It’s also why today, leading generative AI companies such as Anthropic, Lightricks, and LG AI Research trust Google Cloud to power their most demanding AI workloads.¹

This system-level integration lays the foundation for speed, but operational complexity could still slow you down. To accelerate your time-to-market, we provide multiple ways to deploy and manage AI infrastructure, abstracting away the heavy lifting regardless of your preferred workflow. Google Kubernetes Engine (GKE) Autopilot automates management for containerized applications, helping customers like LiveX.AI reduce operational costs by 66%. Similarly, Cluster Director simplifies deployment for Slurm-based environments, enabling customers like LG AI Research to slash setup time from 10 days to under one day.

Managing AI cost and complexity

Forrester gave Google Cloud the highest scores possible in the Pricing Flexibility and Transparency criterion. The price of compute is only one part of the AI infrastructure cost equation. A complete view should also account for development costs, downtime and inefficient resource utilization. We offer optionality at every layer of the stack to provide the flexibility businesses demand.

Flexible consumption: Dynamic Workload Scheduler allows you to secure compute at up to 50% savings, by ensuring you only pay for the capacity you need, when you need it.
Load balancing: GKE Inference Gateway improves throughput by using AI-aware routing to balance requests across models, preventing bottlenecks and ensuring servers aren't sitting idle.
Eliminating data bottlenecks: Anywhere Cache co-locates data with compute, reducing read latency by up to 96% and eliminating the "integration tax" of moving data. By using Anywhere Cache together with our unified data platform BigQuery, you can avoid latency and egress fees while keeping your accelerators fed with data.

Mitigating strategic risk through flexibility and choice

We are also committed to enabling customer choice across accelerators, frameworks and multicloud environments. This isn’t new for us. Our deep experience with Kubernetes, which we developed then open-sourced, taught us that open ecosystems are the fastest path to innovation and provide our customers with the most flexibility. We are bringing that same ethos to the AI era by actively contributing to the tools you already use.

Open source frameworks and hardware portability: We continue to support open frameworks such as PyTorch, JAX, and Keras. We’ve also directly addressed concerns about workload portability on custom silicon by investing in TPU support for vLLM, allowing developers to easily switch between TPUs and GPUs (or use both) with only minimal configuration changes.
Hybrid and multicloud flexibility: Our commitment to choice extends to where you run your applications. Google Distributed Cloud brings our services to on-premises, edge and cloud locations, while Cross-Cloud Network securely connects applications and users with high-speed connectivity between your environments and other clouds. This powerful combination means you're no longer locked into a specific environment; you can easily migrate workloads and apply uniform management practices, streamlining operations, and mitigating the risk of lock-in.

Systems you can rely on

When your entire business model depends on the availability of AI services, infrastructure uptime is critical. Google Cloud's global infrastructure is engineered for enterprise-grade reliability, an approach rooted in our history as the birthplace of Site Reliability Engineering (SRE).

We operate one of the world's largest private software-defined networks, handling approximately 25% of global internet egress traffic. Unlike providers that rely on the public internet, we keep your traffic on Google’s own fiber to improve speed, reliability, and latency. This global backbone is powered by our Jupiter data center fabric, which scales to 13 Petabits/sec of bandwidth, delivering 50x greater reliability than previous generations — to say nothing of other providers. Finally, to improve cluster-level fault tolerance, we employ capabilities like elastic training and multi-tier checkpointing, which allow jobs to continue uninterrupted, by dynamically resizing the cluster around failed nodes while minimizing the time to recovery.

Building on a secure foundation

Our approach is to secure AI from the ground up. In fact, Google Cloud maintains a leading track record for cloud security. Independent analysis from cloudvulndb.org (2024-2025) shows that our platform has up to 70% fewer critical and high vulnerabilities compared to the other two leading cloud providers. We were also the first in the industry to publish an AI/ML Privacy Commitment, which guarantees that we do not use your data to train our models. With those safeguards in place, security is integrated into the foundation of Google Cloud, based on the zero-trust principles that protect Google’s own services:

A hardware root of trust: Our custom Titan chips, as part of our Titanium architecture, create a verifiable hardware root of trust. We recently extended this with Titanium Intelligence Enclaves for Private AI Compute, allowing you to process sensitive data in a hardened, isolated, and encrypted environment.
Built-in AI security: Security Command Center (SCC) natively integrates with our infrastructure, providing AI Protection by automatically discovering assets, preventing security issues, detecting active threats with frontline Google Threat Intelligence, and discovering known and unknown risks before attackers can exploit them.
Sovereign solutions: We enable you to meet stringent data residency, operational control, and software sovereignty requirements through solutions like Data Boundary. This is complemented by flexible options like partner-operated sovereign controls and Google Distributed Cloud for air-gapped needs.
Platform controls for AI and agent governance: Vertex AI provides the essential governance layer for the enterprise builder to deploy models and agents at scale. This trust is anchored in Google Cloud’s secure-by-default infrastructure, utilizing platform controls like VPC Service Controls (VPC-SC) and Customer-Managed Encryption Keys (CMEK) to sandbox environments and protect sensitive data, and Agent Identity for granular IAM permissions. At the platform level, Vertex AI and Agent Builder integrate Model Armor to provide runtime protection against emergent agentic threats, such as prompt injection and data exfiltration.

Delivering continuous AI innovation

We are honored to be recognized as a Leader in The Forrester Wave™ report, which we believe validates decades of R&D and our approach to building ultra-scale AI infrastructure. Look to us to continue on this path of system-level innovation as we help you convert the promise of AI into a reality.

Access the full report: The Forrester Wave™: AI Infrastructure Solutions, Q4 2025

^{1. IDC Business Value Snapshot, Sponsored by Google Cloud, The Business Value of Google Cloud AI Hypercomputer, US53855425, October 2025}

AI agents are here. Is your infrastructure ready?

Thu, 11 Dec 2025 17:00:00 +0000

Editor’s note: Today we hear from Dave McCarthy of IDC about a total cost of ownership crisis for AI infrastructure — and what you can do about it. Read on for his insights.

The AI landscape is undergoing a seismic shift. For the past few years, the industry has been focused on the massive, resource-intensive process of training generative AI models. But the focus is now rapidly pivoting to a new, even larger challenge: inference.

Inference — the process of using a trained model to make real-time predictions — is no longer just one part of the AI lifecycle; it is quickly becoming the dominant workload. In a recent IDC global survey of over 1,300 AI decision-makers, inference was already cited as the largest AI workload segment, accounting for 47% of all AI operations.

This dominance is driven by the sheer volume of real-world applications. While a model is trained periodically, it is used for inference non-stop, with every user query, API call, and recommendation. It is also critical to recognize that this inference surge will be distributed across hybrid environments. According to IDC survey respondents, 63% of workloads will reside in the cloud, which remains the standard for scalable applications like content creation and chatbots. In contrast, 37% will be deployed on on-premises infrastructure, usually related to use cases such as robotics and other systems that interact directly with the physical world.

Now, a new factor is set to multiply this demand: the rise of autonomous and semi-autonomous AI agents.

These "agentic workflows" represent the next logical step in AI, where models don't just respond to a single prompt but execute complex, multi-step tasks. An AI agent might be asked to "plan a trip to Paris," requiring it to perform dozens of interconnected operations: browsing for flights, checking hotel availability, comparing reviews, and mapping locations. Each of these steps is an inference operation, creating a cascade of requests that must be orchestrated across different systems.

This surge in demand is exposing a critical vulnerability for many organizations: the AI efficiency gap.

The TCO crisis in an age of agents

The AI efficiency gap is the difference between the theoretical performance of an AI stack and the actual, real-world performance achieved. This gap is the source of a Total Cost of Ownership (TCO) crisis, and it’s driven by system-wide inefficiencies.

Our research shows that more than half (54.3%) of organizations use multiple AI frameworks and hardware platforms. While this flexibility seems beneficial, it has a staggering downside: 92% of these organizations report a negative effect on efficiency.

This fragmented "patchwork" approach, stitched together from disparate and non-optimized services, creates a ripple effect of problems:

41.6% reported increased compute costs: Redundant processes and poor utilization drive up spending.
40.4% reported increased engineering complexity: Teams spend more time managing the fragmented stack than delivering value.
40.0% reported increased latency: Bottlenecks in one part of the system (like storage or networking) degrade the overall performance of an application.

The core problem is that organizations are paying for expensive, high-performance accelerators, but are failing to keep them busy. Our data shows that 29% of all AI budget waste is tied to inference. This waste is a direct result of idle GPU time (cited by 29.4% of respondents) and inefficient use of resources (22.3%).

When an expensive accelerator is idle, it’s often waiting for data from a slow storage system or for the application server to prepare the next request. This is a system-level failure, not a component failure.

This failure is often compounded by significant hurdles in data management, which serves as the fuel for these AI engines. Survey respondents highlighted three primary challenges contributing to this gap: 47.7% struggle with ensuring data quality and governance, 45.6% grapple with data storage management and related costs, and 44.1% cite the complexity and time required for data cleaning and preparation. When data pipelines cannot keep pace with high-speed accelerators, the entire infrastructure becomes inefficient.

Closing the gap: From fragmented stacks to integrated systems

To scale cost-effectively in the age of AI agents, we must stop thinking about individual components and start focusing on system-level design.

An agentic workflow, for example, requires tight coordination between two distinct types of compute:

General-purpose compute: This is the operational backbone. It runs the application servers, orchestrates the workflow, pre-processes data, and handles all the logic around the model.
Specialized accelerators: This is the high-performance engine that runs the AI model itself.

In a fragmented environment, these two sides are inefficiently connected, and latency skyrockets. The path forward is an optimized architecture where the software, networking, storage, and compute — both general-purpose and specialized — are designed to work as a single, cohesive system.

This holistic approach is the only sustainable way to manage the TCO of AI. It redefines the goal away from simply buying faster accelerators and toward improving the overall "price-performance" and "unit economics" of the entire end-to-end workflow. By eliminating bottlenecks and maximizing the utilization of every resource, organizations can finally close the efficiency gap. Organizations are actively shifting strategies to capture this value. Our survey indicates that 28.9% of respondents are prioritizing model optimization techniques, while 26.3% are partnering with AI service providers to navigate this complexity. Additionally, 25% are investing in training to upskill their teams, ensuring they can increase the value of their AI investments.

The age of inference is here, and the age of agents is right behind it. This next wave of innovation will be won not by the organizations with the most powerful accelerators, but by those who build the most efficient, integrated, and cost-effective systems to power them.

A message from Google Cloud

We sponsored this IDC research to help IT leaders navigate the critical shift to the "Age of Inference." We recognize that the "efficiency gap" identified here — driven by fragmented stacks and idle resources — is the primary barrier to sustainable ROI. That is why we created AI Hypercomputer: an integrated supercomputer system designed to deliver exceptional performance and efficiency for demanding AI workloads.

IDC surveyed 1,300 global IT leaders to uncover how they are designing their stack for maximum efficiency and ROI. Get your free copy of the whitepaper to learn more: The AI Efficiency Gap: From TCO Crisis to Optimized Cost and Performance.