<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Compute</title><link>https://cloud.google.com/blog/products/compute/</link><description>Compute</description><atom:link href="https://cloudblog.withgoogle.com/blog/products/compute/rss/" rel="self"></atom:link><language>en</language><lastBuildDate>Fri, 10 Apr 2026 15:53:35 +0000</lastBuildDate><image><url>https://cloud.google.com/blog/products/compute/static/blog/images/google.a51985becaa6.png</url><title>Compute</title><link>https://cloud.google.com/blog/products/compute/</link></image><item><title>A developer’s guide to architecting reliable GPU infrastructure at scale</title><link>https://cloud.google.com/blog/products/compute/a-guide-to-architecting-reliable-gpu-infrastructure/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="font-style: italic; vertical-align: baseline;"&gt;Editor’s note&lt;/strong&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;strong style="font-style: italic; vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;This blog post outlines Google Cloud’s GPU AI/ML infrastructure reliability strategy, and will be updated with links to new community articles as they appear.&lt;/span&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As we enter the era of multi-trillion parameter models, computational power has transitioned from a utility to a mission-critical strategic asset. To meet relentless training demand, organizations are no longer just building clusters — they are engineering massive, integrated compute ecosystems comprising hundreds of thousands of high-performance accelerators that are interconnected with an ultra-high-bandwidth networking backplane. At this unprecedented scale, raw performance thrives when it is built upon a foundation of systemic resilience.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In "always-on" mission-critical environments, the statistical probability of hardware variance becomes a primary constraint for reliability. When thousands of GPUs are operating at peak utilization for months at a time, a 0.01% performance fluctuation can trigger a systemic failure. The cost of training interruptions now measured in millions of dollars and weeks of lost progress, the industry's focus has shifted. The true frontier of training isn't just about the size of the cluster — it’s about the resilient system architecture that is able to power the next generation of AI workloads.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The core challenge for the industry goes beyond simple hardware fixes; it requires the creation of holistic software and infrastructure frameworks designed to withstand the inevitable disruptions of massive-scale computing. In an environment where AI/ML infrastructure represents a major capital expenditure on a company's balance sheet, partnering with a cloud provider that places a premium on infrastructure reliability is paramount.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Operational realities of AI at scale&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The construction of a supercomputer utilizing hundreds of thousands of advanced GPUs involves significant operational complexity. Maintaining optimal utilization over several months to train a single large language Model (LLM) subjects the hardware to high levels of sustained performance that exceed the design parameters of conventional data center equipment. The advent of rackscale GPU architectures, such as the NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72, has shifted the landscape. Considerations now extend beyond individual machines to encompass entire domains, impacting multiple interconnected trays with the potential to require coordinated management for AI/ML workloads to avoid disruptions.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The business implications of infrastructure instability&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For organizations at the forefront of AI innovation, infrastructure reliability poses a significant commercial risk with substantial economic consequences.&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;High cost of failure:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; A single failure in a massive training job requires restarting from the last checkpoint, wiping out days or even weeks of progress. When infrastructure spend is a big capex, every failure counts. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Delayed time-to-market:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; In the fast-moving AI space, being first matters. Every day spent debugging hardware failures is a day delaying releasing new models while competitors are getting ahead. Reliability issues can directly slow down model iteration cycles, delaying product launches and feature updates.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Operational complexities:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Manually managing a large GPU cluster is a resource-intensive task. Companies come to the cloud to reduce the cost of managing the infrastructure. Without systemic reliability investments, operations teams can get overwhelmed by a constant stream of alerts, forced to play "whack-a-mole" to identify, isolate, and replace faulty nodes thus affecting their time spent on planning for the future capacity and model demands. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Expensive workarounds to mitigate failure impact:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; To achieve a certain level of performance and &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity?e=48754805&amp;amp;_gl=1*9b6bxc*_ga*MjA0OTQyOTQyNi4xNzcyNzc2OTEw*_ga_WH2QY8WWF5*czE3NzI3NzY5MDkkbzEkZzEkdDE3NzI3NzczNzUkajU4JGwwJGgw"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Goodput&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, companies can end up needing to buy 10-20% more hardware than they actually need as a buffer.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Quantitative assessment: Key reliability metrics&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Beyond traditional uptime measurements, the primary metrics Google Cloud uses to measure AI infrastructure health and stability are MTBI and Goodput. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Mean Time Between Interruption (MTBI):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The average time a system runs before encountering an interruption. Includes instance terminations as well as every customer workload interruption that our systems can observe (example GPU XIDs).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Goodput:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The amount of useful computational work completed per unit time.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Google Cloud’s methodology: Engineering systemic resilience&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The objective has shifted from expecting total hardware perfection to engineering systems that demonstrate inherent resilience. We understand that trust in our infrastructure begins with reliability. Our approach is based on four principles:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Proactive prevention:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We’ve integrated hardware validation, real-time telemetry, and automated remediation throughout the infrastructure lifecycle. This &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;systemic approach to &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;shift from reactive troubleshooting to proactive management optimizes the reliability of mission-critical GPUs systems at scale.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Continuous monitoring and intelligent detection:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;We have transformed raw data into actionable insights by synthesizing multi-layered telemetry through automated analysis, to proactively identify and resolve anomalies. This data-driven approach shifts our infrastructure from reactive maintenance to an intelligent, self-healing system that helps ensure continuous workload stability.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Transparency and control:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;We empower users with full visibility and control over GPU infrastructure health. We provide a comprehensive suite of observability metrics and direct tools, allowing customers to correlate hardware status with their workload Goodput and report faults. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Minimizing disruptions:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Our control plane integrates smart scheduling with predictive health signals to enable improved workload migration via maintenance notifications. If unexpected issues arise, customers can enable automated remediations and fast recovery mechanisms to initiate rapid restoration of service. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We have covered an in-depth journey into these principles in our technical deep-dive post linked below. We are launching a comprehensive &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;technical deep dive series&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to explore Google’s approach towards AI/ML infrastructure reliability for Google Cloud GPUs further. Check back here as we add links to learn about:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;a href="https://discuss.google.dev/t/proactive-prevention-inside-google-clouds-multi-layered-gpu-qualification-process/337742" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Proactive prevention: Inside Google Cloud's multi-layered GPU qualification process&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; font-style: italic; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Transparency and Control : Providing Operational Transparency and Management tools to Mitigate GPU Workload Impact (Coming Soon)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Continuous monitoring and intelligent detection: Using ML to predict and prevent GPU downtime (coming soon)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Minimizing disruptions: Smart scheduling and fast recovery systems for mission-critical GPU clusters (coming soon)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;</description><pubDate>Thu, 09 Apr 2026 22:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/a-guide-to-architecting-reliable-gpu-infrastructure/</guid><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>A developer’s guide to architecting reliable GPU infrastructure at scale</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/a-guide-to-architecting-reliable-gpu-infrastructure/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Abhijith Prabhudev</name><title>Product Manager, Google</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Abhay Ketkar</name><title>Senior Staff Software Engineer, Google</title><department></department><company></company></author></item><item><title>AI infrastructure efficiency: Ironwood TPUs deliver 3.7x carbon efficiency gains</title><link>https://cloud.google.com/blog/topics/systems/ironwood-tpus-deliver-37x-carbon-efficiency-gains/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;At Google, we are committed to being &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/sustainability/tpus-improved-carbon-efficiency-of-ai-workloads-by-3x?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;transparent about the environmental impact of our AI infrastructure&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, publishing metrics on the lifetime emissions of our chips — from manufacturing to powering these chips in the data center. Today, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;we are updating these metrics for our seventh-generation TPU, Ironwood, which demonstrates an approximately 3.7x improvement in Compute Carbon Intensity (CCI) compared to TPU v5p&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;the previous generation of performance-optimized TPUs&lt;/span&gt;.&lt;/span&gt;&lt;sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In other words, despite the fact that AI is driving demand for additional compute resources, our ongoing work to optimize AI hardware is helping to improve the energy consumption and emissions of AI workloads.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Measuring AI accelerator efficiency: Compute Carbon Intensity (CCI)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To help manage the environmental impact of AI workloads, we monitor the Compute Carbon Intensity (CCI) of our AI accelerator hardware. CCI is defined in &lt;/span&gt;&lt;a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=11097303" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;An Introduction to Life-Cycle Emissions of Artificial Intelligence Hardware&lt;/span&gt;&lt;/a&gt;&lt;sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;sup&gt; &lt;/sup&gt;as the estimated amount of CO2 equivalent emitted for every utilized floating-point operation (CO2e/FLOP). This metric provides a holistic, chip-level view by including both the embodied emissions associated with manufacturing, transportation, and data center construction (Scope 3), as well as the operational emissions associated with running these chips in data centers (Scope 1 and 2).&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The Ironwood advantage: high performance, low footprint&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Google’s TPU CCI continues to improve with each chip generation. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Drawing from empirical data measured in January 2026, Ironwood demonstrates a remarkable 3.7x &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;improvement&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; in CCI relative to TPU v5p. This accelerates efficiency gains from the 1.2x CCI improvement of TPU v5p relative to TPU v4, and demonstrates continued carbon efficiency optimization of Google’s performance-optimized TPU architecture.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These efficiency gains are driven by outsized compute performance increases between TPU generations relative to growth in machine energy consumption and manufacturing emissions.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; In fact, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;fleetwide measurements demonstrate a 5x improvement in utilized FLOPs across generations, from TPU v5p to Ironwood.&lt;/span&gt;&lt;sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;3&lt;/span&gt;&lt;/span&gt;&lt;/sup&gt;&lt;span style="vertical-align: baseline;"&gt; Because the performance denominator in our CCI equation (CO2e/FLOP) is scaling faster than emissions, the net carbon cost per operation drops significantly with every new chip.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Oan2vLj.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: center;"&gt;&lt;sup&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Figure 1: Ironwood’s accelerating CCI improvement measured on Google’s performance-optimized TPU cohort, considering January 2026 workloads.&lt;/span&gt;&lt;/span&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;4&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Operating Google’s TPU fleet more efficiently&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Updated TPU CCI metrics also offer a direct comparison to the measurement we published in 2025. Specifically, from October 2024 to January 2026, Google’s versatile TPU cohort ran more efficiently than what we reported previously:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;TPU v5e achieved a 43% reduction in total CCI over 15 months, dropping to 228 gCO2e/EFLOP. This was driven by a 72% increase in average utilization.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Trillium, the sixth-generation TPU, saw a 20% reduction in total CCI over the same time period, bringing its emissions intensity down to 125 gCO2e/EFLOP.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_HRjRsFh.max-1000x1000.png"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: center;"&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;Figure 2: Google’s versatile TPU cohort demonstrates deployment efficiency gains for the same TPU generations between October 2024 and January 2026.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: super;"&gt;5&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span&gt;&lt;span style="vertical-align: baseline;"&gt;These results demonstrate that Google continues to improve the carbon-efficiency of our AI infrastructure. While the massive scale of AI demand requires a significant and growing amount of power, our innovations allow us to deliver substantially more compute performance for every unit of energy consumed.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Decoupling energy and emissions from performance&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To what can we attribute these improvements? Beyond Ironwood’s raw hardware capabilities, these CCI gains are further enabled by deep software and system-level optimizations across our infrastructure:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Software efficiency (MoE):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The widespread adoption of sparse architectures, such as Mixture of Experts (MoE), routes computation only to necessary parameters. This drastically reduces the active FLOPs required per inference or training step without sacrificing model capacity or quality.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Lower precision math (FP8):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; By heavily leveraging 8-bit floating-point (FP8) formats, we effectively double compute throughput and halve memory bandwidth requirements compared to 16-bit formats. This shows that we can maintain output quality while exponentially decreasing the energy cost per mathematical operation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Workload mix and intelligent scheduling:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Advanced fleet orchestration continuously balances the workload mix across our infrastructure. By intelligently scheduling tasks, we ensure high continuous utilization rates, optimize duty cycles, and minimize the carbon penalty of idle power draw.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Scale sustainably with Google Cloud&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;AI’s trajectory requires infrastructure that can scale exponentially without an equivalent surge in carbon emissions. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;The 3.7x carbon efficiency improvement from TPU v5p to Ironwood demonstrates that we can achieve greater compute density while minimizing the growth of our energy and environmental footprint through deliberate hardware and software codesign.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; To learn more and get started with Ironwood, register your interest with &lt;/span&gt;&lt;a href="https://cloud.google.com/resources/ironwood-tpu-interest?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;this form&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sub&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;1. Following the methodology published in an &lt;/span&gt;&lt;a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=11097303" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;August 2025 technical report&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;, we quantified the full lifecycle emissions of TPU hardware as a point-in-time snapshot across Google’s generations of TPUs as of January 2026. The functional unit for this study is one AI computer deployed in the data center, which includes one or more accelerator trays (containing TPUs) connected to one host tray (i.e., a computing server). Peripheral components beyond the tray (e.g., rack, shelf, and network equipment) and auxiliary computing and storage resources are excluded from the calculation of embodied and operational emissions. We include the electricity used in data center cooling in operational emissions. To estimate operational emissions from electricity consumption of running workloads, we used a one month sample of observed machine power data from our entire TPU fleet, applying Google’s 2024 average fleetwide carbon intensity. To estimate embodied emissions from manufacturing, transportation, and retirement, we performed a life-cycle assessment of the hardware. Data center construction emissions were estimated based on Google’s disclosed 2024 carbon footprint. These findings do not represent model-level emissions, nor are they a complete quantification of Google’s AI emissions. Based on the TPU location of a specific workload, CCI results of specific workloads may vary.&lt;br/&gt;&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;2. The authors would like to thank and acknowledge the co-authors of this paper for their important contributions to enable these results: Ian Schneider, Hui Xu, Stephan Benecke, Parthasarathy Ranganathan, and Cooper Elsworth.&lt;br/&gt;&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;3. This comparison considers the utilized FLOPS (BF16) between deployed TPU v5p and Ironwood chips in Google’s fleet in January 2026. This trend is consistent with the improvement in peak FLOPS (BF16) between v5p (459 FLOPS) and Ironwood (2,307 FLOPS).&lt;br/&gt;&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;4.The GHG protocol offers two accounting standards for operational emissions. Results presented here consider market-based emissions, which includes the impact of carbon-free energy purchases. Location-based accounting, which excludes carbon-free energy purchases, would raise operational CCI to 793, 712, and 195 gCO2e/EFLOP, respectively. The ratio of CCI improvements would be at a similar level, and Ironwood’s embodied CCI would drop from 23% to 8% of its total CCI.&lt;br/&gt;&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;5. To ensure a fair comparison across varying TPU utilizations, this analysis replicates the propensity score weighting methodology from the &lt;/span&gt;&lt;a href="https://ieeexplore.ieee.org/iel8/40/11236092/11097303.pdf" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;August 2025 technical report&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt; and compares January 2026 results to the results published in 2025. This statistical technique adjusts for duty cycle variations to balance the comparison of TPUs during a given time period. This empirical methodology results in small variations in calculated CCI between temporal periods, reflecting fluctuations in real-world energy consumption and hardware utilization across the global infrastructure. &lt;/span&gt;&lt;/sub&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 06 Apr 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/systems/ironwood-tpus-deliver-37x-carbon-efficiency-gains/</guid><category>Compute</category><category>Sustainability</category><category>TPUs</category><category>Systems</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>AI infrastructure efficiency: Ironwood TPUs deliver 3.7x carbon efficiency gains</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/systems/ironwood-tpus-deliver-37x-carbon-efficiency-gains/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Keguo (Tim) Huang</name><title>Senior Data Scientist, Google</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>David Patterson</name><title>Google Distinguished Engineer, Google</title><department></department><company></company></author></item><item><title>A developer’s guide to training with Ironwood TPUs</title><link>https://cloud.google.com/blog/products/compute/training-large-models-on-ironwood-tpus/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The transition toward trillion-parameter AI models has created an exponential demand for computational resources, testing the limits of traditional infrastructure. The seventh-generation Ironwood TPU features Google’s custom-designed AI infrastructure: It is engineered to scale as a holistic system supporting pods of up to 9,216 chips by combining Inter-Chip Interconnect (ICI), Optical Circuit Switch (OCS), Data Center Network (DCN) and massive aggregated High Bandwidth Memory (HBM) capacity. In addition, Ironwood features an integrated &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;co-design&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; between hardware architecture and software, introducing innovations such as compiler-centric XLA and Python-native kernels via &lt;/span&gt;&lt;a href="https://docs.jax.dev/en/latest/pallas/index.html" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Pallas&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. Together, these features significantly scale organizations’ capacity to train and serve sophisticated frontier models, optimize the entire AI lifecycle and enable sustained high performance. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_YpVMWLp.max-1000x1000.jpg"
        
          alt="image1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This technical overview explores the specific methods and tools within the JAX and MaxText ecosystems designed to refine training efficiency and reach peak performance on Ironwood hardware.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Key optimization strategies for Ironwood&lt;/span&gt;&lt;/h2&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;1. Leverage native FP8 with MaxText&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ironwood is the first TPU generation with native 8-bit floating point (FP8) support in its Matrix Multiply Units (MXUs). By utilizing FP8 precision for weights, activations, and gradients, users can theoretically double throughput compared to Brain Floating Point 16 (BF16). When FP8 recipes are configured correctly, increased efficiency is achievable without compromising model quality. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To implement these FP8 training recipes, users can start with the &lt;/span&gt;&lt;a href="https://github.com/google/qwix" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Qwix&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; library. This functionality is enabled by specifying the relevant flags within the MaxText configuration.  &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;See our blog post, &lt;/span&gt;&lt;a href="https://discuss.google.dev/t/inside-the-optimization-of-fp8-training-on-ironwood/336681" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inside the optimization of FP8 training on Ironwood&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, in the Google Developer forums for more details.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;2. Accelerate with Tokamax kernels&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://github.com/openxla/tokamax/tree/main" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Tokamax&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is a library of high-performance JAX kernels optimized for TPUs. These kernels are designed to mitigate specific bottlenecks through the following mechanisms:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Splash Attention&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This mechanism addresses the I/O limitations inherent in standard attention processes. By maintaining computations within on-chip SRAM, it is particularly effective for processing long context lengths where memory bandwidth typically becomes a constraint. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Megablox Grouped Matrix Multiplication (GMM)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This manages the “ragged” tensors (data structures with inconsistent row lengths that typically create hardware idle time) often found in Mixture of Experts (MoE) models. By utilizing GMM, the system avoids inefficient padding and ensures higher utilization of the MXU. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Kernel tuning&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The Tokamax library includes &lt;/span&gt;&lt;a href="https://github.com/openxla/tokamax/blob/main/tokamax/experimental/utils/tuning/tpu/README.md" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Utilities&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for hyperparameter optimization. These tools allow for the adjustment of tile sizes and other configurations to align with the specific memory hierarchy of the Ironwood TPU.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;3.  Offload collectives to SparseCore&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The fourth-generation SparseCores in Ironwood are processors specifically designed to manage irregular memory access patterns. By using specific &lt;/span&gt;&lt;a href="https://github.com/AI-Hypercomputer/maxtext/blob/c0abc4c0c0a98e02413d7b6c669927d013467045/benchmarks/xla_flags_library.py#L70-L116" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;XLA flags&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, users can offload collective communication operations—such as &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;All-Gather&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Reduce-Scatter&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;—directly to the SparseCore.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This offloading mechanism allows the TensorCores to remain dedicated to primary model computations while communication tasks execute in parallel. This functional overlap is a critical strategy for hiding communication latency and ensuring consistent data throughput to the MXUs.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;4. Fine-tune the memory pipeline on VMEM&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;VMEM, a critical part of the TPU memory architecture, is a fast on-chip SRAM that is designed to optimize kernel performance. You can improve the overall speed of execution by tuning the allocation of VMEM between  current operation and future weight prefetch. For example, increasing the VMEM reserved for the current scope allows increasing the tile sizes used by the kernel, which can increase kernel performance by removing potential memory stalls. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Refer to &lt;/span&gt;&lt;a href="https://docs.jax.dev/en/latest/pallas/tpu/pipelining.html" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;TPU Pipelining&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for more on TPU memory architecture.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;5. Choose optimal sharding strategies&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Lastly, MaxText supports various parallelism techniques which are available on all TPUs. The best choice depends on model size, architecture (Dense vs. MoE), and sequence length. Selecting a proper sharding strategy can improve the performance of the model:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Fully Sharded Data Parallelism (FSDP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This is the preferred strategy for training large models that exceed the memory capacity of a single chip. FSDP shards model weights, gradients, and optimizer states across multiple chips. Increasing the per-device batch size and introducing more compute can hide the latency of the All-Gather operations and improve efficiency.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Tensor Parallelism (TP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Shards individual tensors. Given Ironwood's high arithmetic intensity, TP is most effective for very large model dimensions. Leveraging TP with a dimension of 2 can take advantage of the fast die-to-die interconnect on Ironwood's dual-chiplet design.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Expert Parallelism (EP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Helpful for MoE models to distribute experts across devices.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Context Parallelism (CP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Necessary for very long sequences, sharding activations along the sequence dimension.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hybrid approaches&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Combining strategies is often required to balance compute, memory, and communication on large-scale runs.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;See the &lt;/span&gt;&lt;a href="https://discuss.google.dev/t/optimizing-frontier-model-training-on-tpu-v7x-ironwood/336983/2" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Optimizing Frontier Model Training on TPU v7x Ironwood&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; post in the Developer forums for more detail on techniques 2-5 above.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;strong style="vertical-align: baseline;"&gt;The Ironwood advantage: System-level performance&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These optimization techniques, coupled with Ironwood's architectural strengths like the high-speed 3D Torus Inter-Chip Interconnect (ICI) and massive HBM capacity, create a highly performant platform for training frontier models. The tight co-design across hardware, compilers (XLA), and frameworks (JAX, MaxText) ensures you can extract maximum performance from your AI Infrastructure.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to accelerate your AI journey? Explore the resources below to dive deeper into each optimization method.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Further reading&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;a href="https://discuss.google.dev/t/inside-the-optimization-of-fp8-training-on-ironwood/336681" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inside the optimization of FP8 training on Ironwood&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://discuss.google.dev/t/optimizing-frontier-model-training-on-tpu-v7x-ironwood/336983/2" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Optimizing Frontier Model Training on TPU v7x Ironwood&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sub&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;A special thanks to &lt;/span&gt;&lt;em&gt;&lt;span data-rich-links='{"per_n":"Hina Jajoo","per_e":"hjajoo@google.com","type":"person"}' style="vertical-align: baseline;"&gt;Hina Jajoo&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;span data-rich-links='{"per_n":"Amanda Liang","per_e":"amandaliang@google.com","type":"person"}' style="vertical-align: baseline;"&gt;Amanda Liang&lt;/span&gt;&lt;/em&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt; for their contributions to this blog post.&lt;/span&gt;&lt;/sub&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 23 Mar 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/training-large-models-on-ironwood-tpus/</guid><category>AI &amp; Machine Learning</category><category>TPUs</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>A developer’s guide to training with Ironwood TPUs</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/training-large-models-on-ironwood-tpus/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Lillian Yu</name><title>Product Strategy &amp; Operations</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Liat Berry</name><title>Product Manager, Google TPUs</title><department></department><company></company></author></item><item><title>Google Cloud and NVIDIA expand AI innovation across industries at GTC 2026</title><link>https://cloud.google.com/blog/products/compute/google-cloud-ai-infrastructure-at-nvidia-gtc-2026/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The era of agentic AI is fundamentally changing enterprise infrastructure needs. As organizations build systems capable of dynamic reasoning and autonomous execution, the underlying infrastructure must evolve as well. Scaling these agentic workloads alongside massive mixture-of-experts (MoE) architectures demands a deeply optimized co-engineered stack.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To meet these demands, we’ve built the Google Cloud AI Hypercomputer, an AI-optimized infrastructure as a service, that integrates performance-optimized hardware, leading software, open frameworks, and flexible consumption models into a single, cohesive system to deliver ultra-low latency, high-throughput, and cost-effective inference. To give our customers even more options within this integrated architecture, we are expanding our partnership with NVIDIA.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This week at NVIDIA GTC 2026, Google Cloud and NVIDIA are expanding our partnership with a wave of new announcements, showcasing a co-engineered AI infrastructure foundation:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Infrastructure and hardware&lt;/strong&gt;&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Strong momentum for Google Cloud G4 VMs, powered by NVIDIA RTX PRO&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; 6000 Blackwell Server Edition&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Preview of flexible, fractional G4 VMs using NVIDIA vGPU technology — a first in the industry for NVIDIA RTX PRO&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; 6000 Blackwell Server Edition&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Upcoming support for NVIDIA Vera Rubin NVL72 Platform&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Software and platform&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;NVIDIA Dynamo integration with GKE Inference Gateway&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Enhanced NVIDIA support across Vertex AI Training and Model Garden&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Ecosystem&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;Kaggle competition for NVIDIA Nemotron on G4 VMs&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Launch of a dedicated public sector AI startup accelerator program&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Let’s take a closer look at the announcements.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Accelerating AI workloads with G4 VMs&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;G4 VMs, powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, are built to power a diverse spectrum of high-performance workloads — from advanced spatial computing to complete AI development lifecycles. For instance, companies like Otto Group &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;One.O &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;and WPP use the G4 to run physically accurate simulations and real-time 3D rendering at scale.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Beyond simulation, the G4 also shines in model fine-tuning and inference, particularly for models ranging from 30B to more than 100B parameters. By leveraging 4-bit floating point (FP4) precision and Google’s peer-to-peer (P2P) communication, customers are achieving &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/g4-vms-p2p-fabric-boosts-multi-gpu-workloads"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;higher throughput for model serving and considerable latency reductions&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, enabling a new class of real-time, multimodal AI agents and highly responsive generative AI applications.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Here are some examples of how customers are already leveraging the performance and efficiency of G4 VMs to accelerate their most demanding workloads:&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Google Cloud’s G4 VMs give us the scalable GPU backbone we need to push billions of miles of photorealistic simulation through our pipeline. The 4x lift in throughput means our ML teams can iterate faster, train on richer data, and validate edge cases long before our models ever see the real world.&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;” &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;– Sony Mohapatra, Director, AI/ML Engineering, General Motors&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="vertical-align: baseline;"&gt;“&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Now with G4 VMs powered by NVIDIA Blackwell, we're pushing our multimodal models even further — faster inference, better reliability, instant replies across languages. The goal stays the same: making voice agents that work at enterprise scale without compromise. We are excited to keep building together and see what our customers deploy with this.” &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;– Mati Staniszewski, Cofounder, ElevenLabs&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“Google Cloud G4 VMs provide the computational backbone for our Robotic Coordination Layer, allowing us to synchronize autonomous fleets across our logistics centers with millisecond precision. By simulating complex warehouse environments in a high-fidelity digital twin, we can optimize our entire supply chain virtually before a single robot moves on the floor.”&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; – &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Dr. Stefan Borsutzky, CEO of Otto Group One.O&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“After transitioning to G4 VMs, we achieved a 50% reduction in processing latency and 6x increase in throughput just by updating our Terraform scripts. It’s rare to get that kind of performance boost for our core workloads without adding any operational overhead.”&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; – Alfonso Acosta, Head of Engineering, Imgix&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Introducing fractional G4 VMs &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are excited to announce the preview of fractional G4 VMs, providing a highly efficient and cost-effective entry point for AI and graphics workloads. These new configurations, using NVIDIA virtual GPU (vGPU) technology, allow you to leverage the power of the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs in flexible, smaller increments, so you can right-size your infrastructure to match the specific demands of your applications.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;“Enterprises need unprecedented flexibility to scale complex, agentic AI workloads. With Google Cloud, we’re introducing fractional G4 VMs powered by NVIDIA RTX PRO 6000 to let customers right‑size GPU capacity and maximize ROI. Together with our co‑engineered stack – from NVIDIA NeMo on Vertex AI to NVIDIA Dynamo with GKE – we’re delivering an open, high‑performance platform for next‑generation reasoning and MoE models.” &lt;/span&gt;&lt;/em&gt;&lt;span style="vertical-align: baseline;"&gt;– Ian Buck, VP / General Manager, Hyperscale and HPC, NVIDIA&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By providing more granular access to advanced hardware, fractional G4 VMs let you optimize resource allocation and reduce overhead without sacrificing performance. You can now select from additional GPU slice sizes for your specific needs:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;1/2 GPU:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Ideal for more intensive tasks such as LLM inference, robotics sensor simulation, and high-fidelity 3D rendering.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;1/4 GPU:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Optimized for mainstream workloads, including mid-range creative design, video transcoding, and real-time data visualization.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;1/8 GPU:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Great for lightweight applications such as remote desktops, productivity tools, and entry-level streaming services.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These flexible G4 size portfolio let you:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Right-size infrastructure:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Precisely match GPU capacity to application demands, ranging from lightweight remote desktops to intensive data processing.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Maximize cost efficiency:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Lower operational overhead by utilizing — and paying for — only the fractional GPU resources you need for specific tasks.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Scale diverse workloads:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Power a broad spectrum of innovation, from high-fidelity creative design and streaming to complex robotics simulations and real-time inference.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These fractional G4 VMs can be managed by Google Kubernetes Engine (GKE), allowing developers to use advanced container binpacking to achieve even higher price-performance and resource utilization. When managed through Dynamic Workload Scheduler, you can set fallback priorities for fractional slices. This significantly improves obtainability by allowing the scheduler to automatically find available GPU configurations for each workload.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;The G4 vGPU’s flexible sizing allows us to precisely tailor compute resources to the scale of each molecular simulation, ensuring maximum efficiency across our drug discovery pipeline. This granular control means our researchers can seamlessly pivot between smaller workflows and massive parallel processing without being constrained by fixed hardware configurations.&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;” &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;– Shane Brauner, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;EVP, CIO, Schrödinger&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Scaling AI Hypercomputer with NVIDIA Vera Rubin NVL72&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Building on our deep engineering partnership with NVIDIA, we’re proud to support the successor to NVIDIA Blackwell architecture, the recently announced NVIDIA Vera Rubin platform. We plan to be among the first cloud providers to offer NVIDIA Vera Rubin NVL72 rack-scale systems in the second half of 2026, integrating them into our AI Hypercomputer architecture to empower the next generation of reasoning and agentic AI. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Delivering efficiency across the AI infrastructure stack &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As part of our commitment to a fully open ecosystem, we are excited to announce the integration of Dynamo and GKE &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. This integration provides a modular, open-source control plane across the application layer and the hardware. By combining Dynamo with Inference Gateway on GKE, teams can tailor their infrastructure to their exact needs, allowing them to extract the maximum ROI from accelerators, accelerate time-to-market for new AI models, and future-proof their deployments.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can learn to maximize performance for massive MoE architectures through new &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/scaling-moe-inference-with-nvidia-dynamo-on-google-cloud-a4x?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;advanced scaling recipes&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for A4X VMs (powered by NVIDIA GB200 NVL72 and Dynamo). These configurations show how to overcome memory and interconnect bottlenecks when running AI inference workloads on AI Hypercomputer.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are also enhancing resource obtainability through the Dynamic Workload Scheduler, with Calendar Mode and Flex Start for A4X and A4X Max (powered by NVIDIA GB300 NVL72), as well as new Flex Start support for G4 VMs. Dynamic Workload Scheduler lets you reserve the precise capacity that you need, or use flexible start windows. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Snap, a long-time Google Cloud customer, achieved significant cost savings by migrating two of its primary data processing pipelines to Google Cloud G2 VMs powered by NVIDIA L4 Tensor Core GPUs. This was made possible by leveraging Spark on GKE alongside NVIDIA’s new cuDF libraries, which automated the optimization of its shuffle-heavy workloads for optimal GPU efficiency. &lt;/span&gt;&lt;a href="https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s81678/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Learn more at GTC session S81678.&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Advancing Vertex AI training and Model Garden &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are meeting the demands of next-generation AI with two major infrastructure advancements to &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/docs/training/training-clusters/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI training clusters&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. First, support for &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;A4X VM domains&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; lets you leverage Vertex AI’s managed infrastructure and framework capabilities for massive-scale training on &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA GB200 NVL72 &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;rack-scale systems. To ensure these intensive workloads remain uninterrupted, new hardware resiliency capabilities let you apply configurable, proactive fault detection scans, which identify and mitigate potential hardware issues before they can disrupt critical “hero” training runs. These capabilities enable higher goodput and helps ensure that multi-week training jobs stay on track without costly restarts.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“We are setting a new standard for the agentic enterprise — delivering highly capable, consistent, accurate, and responsive AI agents with Google and NVIDIA. By leveraging Vertex AI training clusters on &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;NVIDIA GB200 NVL72&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt; to power our Agentforce 360 Platform, we’ve eliminated infrastructure bottlenecks to keep our GPUs fully saturated. This high-performance, resilient architecture allows our researchers to focus on innovation at scale, driving substantial gains for our most complex reasoning workloads.” - &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Silvio Savarese, Chief Scientist, Salesforce&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At the same time, we continue to broaden Vertex AI Model Garden with support for &lt;/span&gt;&lt;a href="https://console.cloud.google.com/vertex-ai/publishers/nvidia/model-garden/nemotron-3-super" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NVIDIA’s Nemotron 3&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; family of open models. These include the Nemotron 3 Nano, featuring one-click deployment to simplify integration into private VPCs. We’ve also expanded our catalog to include the NVIDIA Nemotron 3 Super 120B model for immediate access to high-performance, large-scale reasoning. To maximize the value of these models, we’ve integrated NVIDIA’s latest performance libraries directly into Vertex AI to optimize popular open-source models on NVIDIA TensorRT-LLM. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To enable the community to get hands-on with NVIDIA Nemotron on Google Cloud, we &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;are also launching the NVIDIA Nemotron model reasoning challenge on Kaggle, powered by &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;G4 VMs. The competition invites the community to improve Nemotron 3 Nano’s reasoning accuracy on a new benchmark using techniques such as prompting, synthetic data generation, data curation, and fine-tuning – all running on cost-efficient G4 infrastructure so participants can iterate quickly and share their methods with the broader ecosystem. To learn more and register, &lt;/span&gt;&lt;a href="https://www.kaggle.com/competitions/nvidia-nemotron-model-reasoning-challenge" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;visit the Kaggle competition page&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Empowering public sector AI startups &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To foster continued innovation within the ecosystem, Google Public Sector and NVIDIA are launching an AI startup accelerator program. This year-long initiative will support a select cohort of AI-focused Independent Software Vendors (ISVs) building solutions for the public sector.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Participants gain dual access to both NVIDIA Inception and Google Cloud’s ISV accelerator resources. Kicking off at GTC and continuing through Google Cloud Next, this joint program will equip emerging technology leaders with the co-engineered infrastructure, technical guidance, and go-to-market support required to scale mission-critical public sector applications. To learn more about the program, please complete the &lt;/span&gt;&lt;a href="https://docs.google.com/forms/d/e/1FAIpQLSci71lEfkHJKb9wVN2UmXVGaOk3DeB84mW5dve8ulo9kl60pg/viewform" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;interest form&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. Additional cohorts will be selected and announced in the future.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Co-engineering collaboration powers every layer of the AI stack&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The transition to complex, agentic AI demands more than just raw compute. It requires a fully optimized, co-engineered stack. By integrating flexible hardware like fractional G4 instances and the upcoming Vera Rubin platform into our AI Hypercomputer architecture, and pairing it with deep software co-engineering, we provide the scale, resilience, and efficiency you need to turn your most ambitious AI visions into reality.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Coming to GTC? Stop by booth #513 to learn more and talk to our team. And you can always learn more about our collaboration with NVIDIA at &lt;/span&gt;&lt;a href="http://cloud.google.com/NVIDIA"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;cloud.google.com/NVIDIA&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 16 Mar 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/google-cloud-ai-infrastructure-at-nvidia-gtc-2026/</guid><category>AI &amp; Machine Learning</category><category>Partners</category><category>Compute</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/Google_Cloud_NVIDIA_Hero_Image_for_GTC26_Blo.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Google Cloud and NVIDIA expand AI innovation across industries at GTC 2026</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/Google_Cloud_NVIDIA_Hero_Image_for_GTC26_Blo.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/google-cloud-ai-infrastructure-at-nvidia-gtc-2026/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mark Lohmeyer</name><title>VP and GM, AI and Computing Infrastructure</title><department></department><company></company></author></item><item><title>H4D VMs, now GA, deliver exceptional performance and scaling for HPC workloads</title><link>https://cloud.google.com/blog/products/compute/h4d-vms-now-ga/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we’re announcing  the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;general availability of H4D VMs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, our latest high performance computing (HPC)-optimized VM, powered by the 5th Generation AMD EPYC&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;™ processors&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. H4D VMs deliver exceptional performance, scalability, and value for industries like manufacturing, health care and life sciences, weather forecasting, and electronic design automation (EDA).&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; H4D supports orchestration via Cluster Toolkit with Slurm and via Google Kubernetes Engine (GKE). Each approach allows for near-instant deployment and scaling of demanding workloads.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For the first time, the Google Cloud CPU portfolio features a VM family with &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;C&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt;loud Remote Direct Memory Access (RDMA).&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;H4D’s RDMA is on the &lt;/span&gt;&lt;a href="https://cloud.google.com/titanium"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium network adapter&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and lets you scale single-node H4D performance to multiple nodes, accelerating large production workloads. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Faster time to solution across domains and scales&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Powered by the high core density of the 5th Gen AMD EPYC CPU and Google’s innovative, low-latency &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/systems/introducing-falcon-a-reliable-low-latency-hardware-transport"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Falcon hardware transport&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;,&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; H4D VMs enable you to iterate and discover faster than ever before.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We demonstrated H4D performance through a series of industry-standard benchmarks, showing its capabilities across diverse domains and problem sizes.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Healthcare and life sciences&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;For researchers in healthcare and life sciences (HCLS), H4D VMs accelerate complex molecular simulations critical to scientific discovery. Compared to our previous C2D VMs, H4D VMs deliver up to a 4.3X speedup running LAMMPs (LJ benchmark) at 96 VMs, delivering 95% parallel efficiency on 18k cores. For drug discovery, we demonstrated a 5.8X speed-up using GROMACS (water_33m) at 32 VMs delivering 72% parallel efficiency on 6k cores. H4D also delivers further scalability, which we demonstrated by running the LAMMPS LJ benchmark on 192 VMs (&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;~37k cores) while maintaining 92% parallel efficiency (see Figure 3).&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_JTLuwUW.max-1000x1000.jpg"
        
          alt="1-Figuer1&amp;amp;2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--medium
      
      
        h-c-grid__col
        
        h-c-grid__col--4 h-c-grid__col--offset-4
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/2_RA1vjLg.jpg"
        
          alt="2-Figuer3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Manufacturing&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;For manufacturing, H4D VMs help engineers shorten design cycles, run larger simulations, and iterate faster by delivering a strong performance boost for mission-critical Computer-Aided Engineering (CAE) workflows. Compared to our previous C2D VMs when running complex Computational Fluid Dynamics (CFD) simulations, H4D VMs deliver a 4.1X speedup running Ansys Fluent (F1_RaceCar_140m benchmark) on 32 VMs with 85% parallel efficiency. When running open-source OpenFOAM  (Motorbike_100m), we demonstrated a 5.2X speedup over C2D using 16 VMs and achieving superlinear parallel efficiency of 122%.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/3_9YSJuty.jpg"
        
          alt="3-Figuer4&amp;amp;5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;A new standard for HPC price/performance&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;H4D VMs are designed to deliver the best price-performance for HPC workloads on Google Cloud by pairing superior performance with flexible consumption models. H4D supports Dynamic Workload Scheduler (DWS), which adapts to your workflow with Flex Start mode for just-in-time capacity and Calendar mode for guaranteed reservations. This allows you to access compute for as low as 3 cents per core-hour without long-term commitments. The resulting performance and cost efficiencies over previous generation VMs are detailed in Figures 6 and 7. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/4_VFxG3YM.jpg"
        
          alt="4-Figuer6"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/5_FKrLh4Z.jpg"
        
          alt="5-Figuer7"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Comprehensive HPC management&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To manage and deploy large, dense clusters of H4D VMs, you can leverage Google Cloud’s &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/ai-hypercomputer/docs/cluster-capabilities"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cluster Director&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which offers advanced maintenance capabilities (you can sign up for the preview &lt;/span&gt;&lt;a href="https://forms.gle/dppWNms5DF44gCwV9" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;) alongside the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/cluster-toolkit/docs/overview"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cluster Toolkit&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for rapid cluster deployment  via turnkey system blueprints. For job and workload management, H4D VMs integrate with &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/batch/docs/get-started"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Batch&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, Google Cloud’s fully managed, cloud-native service that handles queuing, scheduling, and resource provisioning. Additionally, there’s support for &lt;/span&gt;&lt;a href="https://cloud.google.com/products/dws/pricing?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;DWS&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which can be used in both Calendar mode for future reservations and Flex Start mode for time-limited, on-demand usage.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;What customers and partners are saying&lt;/span&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/jump.max-1000x1000.jpg"
        
          alt="jump"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="ciutv"&gt;&lt;i&gt;“We were able to test the H4D platform in early access at&lt;/i&gt; &lt;a href="https://www.jumptrading.com/"&gt;&lt;i&gt;Jump Trading&lt;/i&gt;&lt;/a&gt;&lt;i&gt;, and were extremely impressed with the results. The successful testing process demonstrated that H4D offers the performance, stability, and efficiency we require for demanding, high-volume operations. We see up to 50% better price/performance compared to prior generation machines and are now accelerating integration with our critical grid workloads on Google Cloud."&lt;/i&gt; &lt;b&gt;- Alex Davies, Chief Technology Officer &amp;amp; Benjamin Stromski, HPC Linux Engineering, Jump Trading&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/hmx_labs.max-1000x1000.jpg"
        
          alt="hmx labs"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="ciutv"&gt;&lt;i&gt;“There lingers, especially in large-scale and compute-intensive domains, the idea that the fastest systems can only be built on premises and run on bare metal hardware. Terms such as ‘hypervisor tax” are often thrown around as justification for operating with bare metal. Our testing paints a different picture. The Google H4D VM performs better on our financial risk benchmark than the bare metal top of stack AMD CPU of the same generation."&lt;/i&gt; &lt;b&gt;- Hamza Mian/CEO, HMxLabs&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/totalcare.max-1000x1000.jpg"
        
          alt="totalcare"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="ciutv"&gt;&lt;i&gt;"As a leading provider of managed HPC solutions for the demanding CAE and manufacturing sectors, our evaluation of the H4D platform was focused heavily on its ability to handle our clients' largest, most tightly-coupled simulation workloads. We are extremely impressed with the results. The testing confirmed that the underlying RDMA fabric exhibits the outstanding low-latency and high-bandwidth performance required for massive parallel processing. This level of interconnect efficiency is non-negotiable for speeding up critical manufacturing simulations like crash testing and CFD. H4D has proven itself to be a true accelerator for high-throughput engineering workloads, and we are excited about its potential to redefine the performance ceiling for HPC in the engineering world."&lt;/i&gt; &lt;b&gt;- Rodney Mach/President, TotalCAE&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/Google.max-1000x1000.jpg"
        
          alt="Google"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="ciutv"&gt;&lt;b&gt;&lt;i&gt;“&lt;/i&gt;&lt;/b&gt;&lt;i&gt;The new H4D instances are a significant step forward for our demanding next-generation TPU simulation workloads. We've seen a 30% performance improvement across a variety of EDA benchmarks compared to C2D, demonstrating the strong single core performance of H4D. This directly translates to faster development cycles and allows our engineering teams to iterate more quickly”&lt;/i&gt;&lt;b&gt; - Trevor Switkowski, Technical Lead of Chip Design Methodology, Google Cloud&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Experience H4D today&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;H4D is now available in &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;us-central1-a (Iowa), europe-west4-b (Netherlands) and asia-southeast1-a (Singapore)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; with additional regions coming soon. Check regional availability on our &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/regions-zones#available"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Regions and Zones page&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and deploy your most demanding HPC workloads by leveraging &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/create-vm-with-rdma"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud RDMA&lt;/span&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;. &lt;/strong&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sub&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;The following configurations were run for the above benchmarks:&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;LAMMPS version 20250722, GROMACS: version 2023.1, OpenFOAM version 2312, Ansys Fluent version 2024R1. All runs used IntelMPI 2021.17.2. C2D/C3D/C4D used TCP, H4D used RDMA with RXM &amp;amp; SAR_LIMIT=2G. All runs used full ppn (processes-per-node) available on each platform (56, 180, 192 for C2D, C3D and C4D/H4D respectively). Ansys Fluent runs used 168ppn on H4D and variable ppn for C4D. SMT off for all. Cost comparision across single nodes of H4D-highmem-192 with DWS Flex Start price, c3d-standard-360 and c2d-standard-112 OD price.&lt;/span&gt;&lt;/em&gt;&lt;/sub&gt;&lt;/p&gt;
&lt;p&gt;&lt;sub&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Parallel efficiency and optimal node count depend on input size and communication patterns, and therefore vary across workloads.&lt;/span&gt;&lt;/em&gt;&lt;/sub&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 04 Mar 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/h4d-vms-now-ga/</guid><category>HPC</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>H4D VMs, now GA, deliver exceptional performance and scaling for HPC workloads</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/h4d-vms-now-ga/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Aysha Keen</name><title>Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Felix Schürmann</name><title>Senior HPC Technologist</title><department></department><company></company></author></item><item><title>Simpler billing, clearer savings: A FinOps guide to updated spend-based CUDs</title><link>https://cloud.google.com/blog/topics/cost-management/a-finops-professionals-guide-to-updated-spend-based-cuds/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Optimizing cloud spend is one of the most rewarding aspects of FinOps — and committed use discounts (CUDs) remain one of the most effective levers to pull.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In July 2025, we began rolling out &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-multiprice"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;updates to the spend-based CUD model&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to make it easier to understand your costs and savings, expand coverage to new SKUs (including Cloud Run and H3/M-series VMs), and offer increased flexibility. These changes are now available to all customers. Let’s dive into how this new model simplifies your FinOps practice.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;1. What is the spend-based CUD data change all about? &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The most important shift is the move from a credit-based system to a &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;direct discounted price model using &lt;/strong&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-multiprice#consumption-model-intro"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;consumption models.&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Under the old &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;credits model&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, you committed to an hourly on-demand amount. To find your &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;savings&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; (the actual cost reduction realized), you had to use three different numbers: the full on-demand cost, the commitment fee, and the offsetting credit.&lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;1. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;The old math:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;$10.00 (On-demand) + $5.50 (Commitment fee) - $10.00 (Credit) = $5.50 (Net Cost)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Savings = $10.00 (On-demand) - $5.50 (Net costs) = $4.50&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With the new &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-multiprice#consumption-model-intro"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;direct discount model&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, you don’t need to do that math to calculate your net costs. You commit directly to the net, discounted spend amount. Your usage is simply billed at that discounted rate.&lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;2. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;The new math:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;ol style="list-style-type: lower-alpha;"&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;$5.50 (Discounted costs)&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Savings = $10.00 (On-demand) - $5.50 (Discounted costs) = $4.50&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;  &lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can now see your net cost at a glance, and calculating the savings only requires comparing the on-demand price ($10.00) to your new discounted cost ($5.50), which equals &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;$4.50/hr.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;2. How do I validate my savings before and after the changes?  &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The unified &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/billing/docs/how-to/analyze-cuds"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;CUD Analysis tool&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is your best resource for auditing the migration or performing deep-dives on your spend. CUD Analysis for the new spend-based CUD model&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; allows you to quickly verify the savings you are getting with the new model, and you can use this tool to compare that the savings didn’t change between the old and the new model. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can validate your savings by following these steps:&lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;1. Identify the date when the migration took place; you can see the migration date in the billing overview page.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_jzjRx1j.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;2. Go to CUD Analysis to validate the savings before and after the migration. &lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;3. To quantify costs from before the migration:&lt;/span&gt;&lt;/p&gt;
&lt;ol style="list-style-type: lower-alpha;"&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Filter the view for one day before the migration, in this case &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Oct. 26, 2025.&lt;/strong&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;Select a CUD Product, for example &lt;strong style="vertical-align: baseline;"&gt;Cloud SQL CUD.&lt;/strong&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;In our example, &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;we&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;paid a $50.35 CUD fee to get a $69.12 credit. When you subtract that fee from the credit, your actual take-home &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;savings were $18.77&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_2jbhCzc.max-1000x1000.png"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;4. To validate costs after the migration&lt;/span&gt;&lt;/p&gt;
&lt;ol style="list-style-type: lower-alpha;"&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Change the date to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Oct. 28, 2025&lt;/strong&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;Under the new model, you pay the discounted rates upfront. Your dashboard will reflect a Net Cost of $50.35, compared to the $69.12 on-demand cost, clearly showing your &lt;strong style="vertical-align: baseline;"&gt;$18.77 in savings.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_nQjMUwd.max-1000x1000.png"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In addition, this release also includes &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-verify-discounts#example_cost_reports"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;an update to &lt;/span&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cost Reports&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to include “Savings Programs,” which accurately reflects your actual net savings ($18.77 in our example above), rather than gross credit. &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;When comparing pre- and post-migration data in Cost Reports, ensure you include both usage SKUs and commitment fee SKUs to capture the full scope of the commitment.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;3. What other capabilities are in the new CUD Analysis?&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Beyond support for the new model, the new &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/billing/docs/how-to/analyze-cuds"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;CUD Analysis tool&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; offers deeper visibility into your CUD coverage and CUD utilization. You can now analyze your CUDs with &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;hourly data granularity&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for up to 30 days. This is a major improvement for FinOps teams, as daily averages often hide underutilization spikes that occur during specific hours.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/4_HLosdOT.max-1000x1000.png"
        
          alt="4"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="rirdr"&gt;CUD Analysis: Compute Flexible CUD coverage analysis&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/5_9A7ZjUx.max-1000x1000.png"
        
          alt="5"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="rirdr"&gt;CUD Analysis: Per CUD purchase utilization visibility&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If you want to use your own data analysis tools, we offer a new &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/billing/docs/how-to/export-data-bigquery-tables/cud-export"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;spend-based CUD metadata export&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;that lets you manage your spend-based CUDs programmatically. You can use this export to join with the Billing BigQuery Export datasets to run in-depth, programmatic analysis on all your commitment data. You can also export &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/billing/docs/how-to/analyze-cuds#download_your_report"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;a CSV from the CUD Analysis view&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to see the raw data for every resource and its price without needing the full BigQuery export.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;4. How much commitment should I buy? &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-recommender"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;CUD recommendations&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; are the primary tool for determining how much of a commitment to purchase. We recently enhanced our Compute Flexible CUD commitment recommendations to provide greater accuracy by including data from GKE, Cloud Run, Cloud Run Functions, and Compute Engine. Additionally, CUD scenario modeling allows you to adjust these suggestions in real-time. You can adjust coverage thresholds, filter out specific dates with irregular usage, or extend the lookback analysis window up to 180 days to identify the exact commitment level that aligns with your specific risk profile.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/6_MpUcC4f.max-1000x1000.png"
        
          alt="6"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="rirdr"&gt;CUD scenario modeling: experiment with multiple options to identify your ideal CUD strategy&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;5. Is there anything else I should know about Flex CUDs? &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With the release of the new spend-based model, we’ve addressed the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;reporting limitation&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; affecting customers who use a combination of &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/committed-use-discounts-overview#spend_based"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Flex CUDs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and GKE/Cloud Run CUDs. Previously, our analysis tools were unable to accurately identify the source of specific credits, leading to discrepancies in KPI metrics like savings, coverage, and utilization. Under the new spend-based CUD model, this limitation has been corrected, so your CUD analysis now provides an accurate, granular view of your savings per Google Cloud service.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To begin navigating the updated spend-based model, visit the Billing console. You can learn more in our documentation:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;a href="https://cloud.google.com/docs/cuds-multiprice"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Enhancements to the Spend-based CUD program &lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://cloud.google.com/docs/cuds-multiprice-datamodel"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Insights into the multi-price data model&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/docs/cuds-verify-discounts"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Verify your savings post-migration&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-related_article_tout"&gt;





&lt;div class="uni-related-article-tout h-c-page"&gt;
  &lt;section class="h-c-grid"&gt;
    &lt;a href="https://cloud.google.com/blog/products/compute/expanded-coverage-for-compute-flex-cuds/"
       data-analytics='{
                       "event": "page interaction",
                       "category": "article lead",
                       "action": "related article - inline",
                       "label": "article: {slug}"
                     }'
       class="uni-related-article-tout__wrapper h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
        h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3 uni-click-tracker"&gt;
      &lt;div class="uni-related-article-tout__inner-wrapper"&gt;
        &lt;p class="uni-related-article-tout__eyebrow h-c-eyebrow"&gt;Related Article&lt;/p&gt;

        &lt;div class="uni-related-article-tout__content-wrapper"&gt;
          &lt;div class="uni-related-article-tout__image-wrapper"&gt;
            &lt;div class="uni-related-article-tout__image" style="background-image: url('')"&gt;&lt;/div&gt;
          &lt;/div&gt;
          &lt;div class="uni-related-article-tout__content"&gt;
            &lt;h4 class="uni-related-article-tout__header h-has-bottom-margin"&gt;Save more with expanded coverage for Compute Flex CUDs&lt;/h4&gt;
            &lt;p class="uni-related-article-tout__body"&gt;Compute Flexible Committed Use Discounts (Flex CUDs) now cover memory-optimized and HPC VM families and Cloud Run.&lt;/p&gt;
            &lt;div class="cta module-cta h-c-copy  uni-related-article-tout__cta muted"&gt;
              &lt;span class="nowrap"&gt;Read Article
                &lt;svg class="icon h-c-icon" role="presentation"&gt;
                  &lt;use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#mi-arrow-forward"&gt;&lt;/use&gt;
                &lt;/svg&gt;
              &lt;/span&gt;
            &lt;/div&gt;
          &lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/a&gt;
  &lt;/section&gt;
&lt;/div&gt;

&lt;/div&gt;</description><pubDate>Thu, 12 Feb 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/cost-management/a-finops-professionals-guide-to-updated-spend-based-cuds/</guid><category>Compute</category><category>Cost Management</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Simpler billing, clearer savings: A FinOps guide to updated spend-based CUDs</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/cost-management/a-finops-professionals-guide-to-updated-spend-based-cuds/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Alfonso Hernandez</name><title>Sr. Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Rahul Sharma</name><title>Sr. Product Manager</title><department></department><company></company></author></item><item><title>High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run</title><link>https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Running large-scale inference models can involve significant operational toil, including cluster management and manual VM maintenance. One solution is to leverage a serverless compute platform to abstract away the underlying infrastructure. Today, we’re bringing the serverless experience to high-end inference with support for &lt;/span&gt;&lt;a href="https://www.nvidia.com/en-us/data-center/rtx-pro-6000-blackwell-server-edition/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NVIDIA RTX PRO™ 6000 Blackwell Server Edition GPUs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; on Cloud Run. Now in preview, you can deploy massive models like Gemma 3 27B or Llama 3.1 70B with the 'deploy and forget' experience you’ve come to expect from Cloud Run. No reservations. No cluster management. Just code.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A powerful GPU platform&lt;/strong&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_qqUpivV.max-1000x1000.jpg"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The NVIDIA RTX PRO 6000 Blackwell GPU provides a huge leap in performance compared to the NVIDIA L4 GPU, bringing 96GB vGPU memory, 1.6 TB/s of bandwidth and support for FP4 and FP6. This means you can serve up to 70B+ parameter models without having to manage any underlying infrastructure. Cloud Run lets you attach a NVIDIA RTX PRO 6000 Blackwell GPU to your Cloud Run service, job, or worker pools, on demand, with no reservations required. Here are some ways you can use the NVIDIA RTX PRO 6000 Blackwell GPU to accelerate your business:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Generative AI and inference:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; With its FP4 precision support, the NVIDIA RTX PRO 6000 Blackwell GPU’s high-efficiency compute accelerates LLM fine-tuning and inference, letting you create real-time generative AI applications such as multi-modal and text-to-image creation models. By &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/services/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;running your model on Cloud Run services&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, you can also take advantage of rapid startup and scaling, going from zero instances to having a GPU with drivers installed under 5 seconds. When traffic eventually scales down zero and no more requests are being received, Cloud Run automatically scales your GPU instances down to zero.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong style="vertical-align: baseline;"&gt;Fine-tuning and offline inference&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: NVIDIA RTX PRO 6000 Blackwell GPUs can be used in conjunction with &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/jobs/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Run jobs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to fine-tune your model. The fifth-generation NVIDIA Tensor Cores can be used in conjunction with AI models to help accelerate rendering pipelines and enhance content creation. &lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Tailored scaling for specialized workloads&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Use &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/workerpools/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GPU-enabled worker pools&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to apply granular control over your GPU workers, whether you need to dynamically scale based on custom external metrics or manually provision "always-on" instances for complex, stateful processing.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We built Cloud Run to be the simplest way to run production-ready, GPU-accelerated tasks. Some highlights of Cloud Run include: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Managed GPUs with flexible compute: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud Run pre-installs the necessary NVIDIA drivers so you can focus on your code. Cloud Run instances using NVIDIA RTX PRO 6000 Blackwell GPUs can configure up to 44 vCPU and 176GB of RAM.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Production-grade reliability:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; By default, Cloud Run offers zonal redundancy, helping to ensure enough capacity for your service to be resilient to a zonal outage; this also applies to Cloud Run with GPUs. Alternatively, you can turn off zonal redundancy and benefit from a lower price for best-effort failover of your GPU workloads in case of a zonal outage.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Tight integration&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Cloud Run works natively with the rest of Google Cloud. You can load massive model weights by mounting Cloud Storage buckets as local volumes, or use &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/iap/docs/enabling-cloud-run"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Identity-Aware Proxy (IAP)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to secure traffic that’s bound for a Cloud Run service.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Get started&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The NVIDIA RTX PRO 6000 Blackwell GPU is available in preview on demand with availability in &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;us-central1&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;europe-west4&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, and limited availability in &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;asia-south2&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;asia-southeast1&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. You can deploy your first service using &lt;/span&gt;&lt;a href="https://ollama.com/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Ollama&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, one of the easiest way to run open models, on Cloud Run with NVIDIA RTX PRO 6000 GPUs enabled:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;gcloud beta run deploy my-service  \\\r\n--image ollama/ollama --port 11434 \\\r\n--cpu 20 --memory 80Gi \\\r\n--gpu-type nvidia-rtx-pro-6000 \\\r\n--no-gpu-zonal-redundancy \\\r\n--region us-central1&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7fa0a1b63070&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For more details, check out our updated &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/services/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Run documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/configuring/services/gpu-best-practices"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AI inference best practices&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 02 Feb 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><category>Serverless</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>James Ma</name><title>Sr. Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Oded Shahar</name><title>Sr. Engineering Manager</title><department></department><company></company></author></item><item><title>Unlock 2x better price-performance with Axion-based N4A VMs, now generally available</title><link>https://cloud.google.com/blog/products/compute/axion-based-n4a-vms-now-in-preview/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;January 27, 2026: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;The N4A is now generally available. You can get started by deploying &lt;/span&gt;&lt;a href="http://console.cloud.google.com/compute/instancesAdd"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;N4A from the Google Cloud console&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Decision makers and builders today face a constant challenge: managing rising cloud costs while delivering the performance their customers demand. As applications evolve to use scale-out microservices and handle ever-growing data volumes, organizations need maximum efficiency from their underlying infrastructure to support their growing general-purpose workloads.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image5_bCjzyyQ.max-1000x1000.png"
        
          alt="image5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To meet this need, we’re excited to announce our latest Axion-based virtual machine series: N4A, available in preview on Compute Engine, Google Kubernetes Engine (GKE), Dataproc, and Batch, with support in Dataflow and other services coming soon. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4A is the most cost-effective N-series VM to date, delivering &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;up to 2x better price-performance and 80% better performance-per-watt &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;than comparable current-generation x86-based VMs. This makes it easier for customers to further optimize the Total Cost of Ownership (TCO) for a broad range of general-purpose workloads. We see this with cloud-native businesses running scale-out web servers and microservices on GKE, enterprise teams managing backend application servers and mid-sized databases, and engineering organizations operating large CI/CD build farms. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At Google Cloud, we co-design our compute offerings with storage, networking and software at every layer of the stack, from orchestrators to runtimes, to deliver exceptional system-level performance and cost-efficiency. N4A’s breakthrough price-performance is powered by our latest-generation Google Axion Processors, built on the Arm® Neoverse® N3 compute core, Google &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/dynamic-resource-management"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Dynamic Resource Management&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (DRM) technology, and &lt;/span&gt;&lt;a href="https://cloud.google.com/titanium"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, Google Cloud’s custom-designed hardware and software system that offloads networking and storage processing to free up the CPU. Titanium is part of Google Cloud’s vertically integrated software stack — from the custom silicon in our servers to our planet-scale network traversing &lt;/span&gt;&lt;a href="https://cloud.google.com/about/locations"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;7.75 million kilometers of terrestrial and subsea fiber&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; across 42 regions — that is engineered to maximize efficiency and provide the ultra-low latency and high bandwidth to customers at global scale.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Redefining general-purpose compute and enabling AI inference&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4A is engineered for versatility, with a feature set to support your general-purpose and CPU-based AI workloads. It comes in predefined and custom shapes, with up to 64 vCPUs and 512GB of DDR5 in high-cpu (2GB of memory per vCPU), standard (4GB per vCPU), and high-memory (8GB per vCPU) configurations, with instance networking up to 50 Gbps of bandwidth. N4A VMs feature support for our latest generation Hyperdisk storage options, including Hyperdisk Balanced, Hyperdisk Throughput, and Hyperdisk ML (coming later), providing up to 160K IOPS, 2.4GB/s of throughput per instance. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4A performs well across a range of industry-standard benchmarks that represent the key workloads our customers run every day. For example, relative to comparable current-generation x86-based VM offerings, N4A delivers up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;105%&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; better price-performance for &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;compute-bound workloads&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;90%&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; better price-performance for &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;scale-out web servers&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;85%&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; better price-performance for &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Java applications&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, and up to&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; 20%&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; better price-performance for general-purpose databases.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_q9MnCJ1.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="dxvss"&gt;Footnote: As of October 2025. Performance based on the estimated SPECrate®2017_int_base, estimated SPECjbb2015, MySQL Transactions/minute (RO), and Google internal Nginx Reverse Proxy benchmark scores run in production on comparable latest-generation generally-available VMs with general purpose storage types. Price-performance claims based on published and upcoming list prices for Google Cloud.&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the real world, early adopters are seeing dramatic price-performance improvements from the new N4A instances.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_3I8oyl8.max-1000x1000.jpg"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="59dyk"&gt;&lt;i&gt;"At ZoomInfo, we operate a massive data intelligence platform where efficiency is paramount. Our core data processing pipelines, which are critical for delivering timely insights to our customers, run extensively on Dataflow and Java services in GKE. In our preview of the new N4A instances, we measured a 60% improvement in price-performance for these key workloads compared to their x86-based counterparts. This allows us to scale our platform more efficiently and deliver more value to our customers, faster."&lt;/i&gt; - &lt;b&gt;Sergei Koren, Chief Infrastructure Architect, ZoomInfo​&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_nDU2gjP.max-1000x1000.jpg"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="xulw1"&gt;&lt;i&gt;“Organizations today need performance, efficiency, flexibility, and scale to meet the computing demands of the AI era; this requires the close collaboration and co-design that is at the heart of our partnership with Google Cloud. As N4A redefines cost-efficiency, customers gain a new level of infrastructure optimization, enabling enterprises to choose the right infrastructure for their workload requirements with Arm and Google Cloud.”&lt;/i&gt; - &lt;b&gt;Bhumik Patel, Director, Server Ecosystem Development, Infrastructure Business, Arm&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Granular control with Custom Machine Types and Hyperdisk&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A key advantage of our N-series VMs has always been flexibility, and with N4A, we are bringing one of our most popular features to the Axion family for the first time: Custom Machine Types (&lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/instances/creating-instance-with-custom-machine-type"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;CMT&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;). Instead of fitting your workload into a predefined shape, CMTs on N4A lets you independently configure the amount of vCPU and memory to meet your application's unique needs. This ability to right-size your instances means you pay only for the resources you use, minimizing waste and optimizing your total cost of ownership.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This same principle of matching resources to your specific workload applies to storage. N4A VMs feature support for our latest generation of &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/disks/hyperdisks"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Hyperdisk&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, allowing you to select the perfect storage profile for your application's needs:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk Balanced:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Offers an optimal mix of performance and cost for the majority of general-purpose workloads, with up to 160K IOPs per N4A VM.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk Throughput:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Delivers up to 2.4GiBps of max throughput for bandwidth-intensive analytics workloads like Hadoop or Kafka, providing high-capacity storage at an excellent value.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk ML &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;(post GA)&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Purpose-built for AI/ML workloads, allows you to attach a single disk containing your model weights or datasets to up to 32 N4A instances simultaneously for large-scale inference or training tasks.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hyperdisk Storage Pools&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Instead of provisioning capacity and performance on a per-volume basis, allows you to provision performance and capacity in aggregate, &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/cost-saving-strategies-when-migrating-to-google-cloud-compute?e=48754805#:~:text=2.%20Optimize%20your%20block%20storage%20selections"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;further optimizing costs by up to 50%&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and simplifying management.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/4_ZB4gdHF.max-1000x1000.jpg"
        
          alt="4"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="7cqx3"&gt;&lt;i&gt;"At Vimeo, we have long relied on Custom Machine Types to efficiently manage our massive video transcoding platform. Our initial tests on the new Axion-based N4A instances have been very compelling, unlocking a new level of efficiency. We've observed a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs. This points to a clear path for improving our unit economics and scaling our services more profitably, without changing our operational model."&lt;/i&gt; - &lt;b&gt;Joe Peled, Sr. Director of Hosting &amp;amp; Delivery Ops, Vimeo&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A growing Arm-based Axion portfolio for customer choice&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C-series VMs are designed for workloads that require consistently high performance, e.g., medium-to-large-scale databases and in-memory caches. Alongside them, N-series VMs have been a key Compute Engine pillar, offering a balance of price-performance and flexibility, lowering the cost of running workloads with variable resource needs such as scale-out Java/GKE workloads. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;We released our first Axion-based machine series, &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/try-c4a-the-first-google-axion-processor?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;C4A&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, in October 2024, and the &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;introduction of N4A complements C4A, providing a range of Google Axion instances suited to your workloads’ precise needs. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;On top of that, GKE unlocks significant price-performance advantages by orchestrating Axion-based C4A and N4A machine types. GKE leverages &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/concepts/about-custom-compute-classes"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Custom Compute Classes&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to provision and mix these machine types, matching workloads to the right hardware. This automated, heterogeneous cluster management allows teams to optimize their total cost of ownership across their entire application stack.&lt;/span&gt;&lt;span style="text-decoration: line-through; vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Also &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/new-axion-c4a-metal-offers-bare-metal-performance-on-arm"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;joining the Axion family is C4A.metal&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, Google Cloud’s first Axion bare metal instance that helps builders meet use cases that require access to the underlying physical server to run specialized applications in a non-virtualized environment, such as automotive systems development, workloads with strict licensing requirements, and Android software development. &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/new-axion-c4a-metal-offers-bare-metal-performance-on-arm"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;C4A.metal will be available in preview soon&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Supported by the broad and mature Arm ecosystem, adopting Axion is easier than ever, and the combination of C4A and N4A can help you lower the total cost of running your business, without compromising on performance or workload-specific requirements&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;N4A for cost optimization and flexibility.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Deliberately engineered for general-purpose workloads that need a balance of price and performance, including scale-out web servers, microservices, containerized applications, open-source databases, batch, data analytics, development environments, data preparation and AI/ML experimentation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;C4A for consistently high performance, predictability, and control.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Powering workloads where every microsecond counts, such as medium- to large-scale databases, in-memory caches, cost-effective AI/ML inference, and high-traffic gaming servers. C4A delivers consistent performance, offering a controlled maintenance experience for mission-critical workloads, networking bandwidth up to 100 Gbps, and next-generation Titanium Local SSD storage. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/5_m4GINGe.max-1000x1000.jpg"
        
          alt="5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="7cqx3"&gt;&lt;i&gt;"Migrating to Google Cloud's Axion portfolio gave us a critical competitive advantage. We slashed our compute consumption by 20% while maintaining low and stable latency with C4A instances, such as our Supply-Side Platform (SSP) backend service. Additionally, C4A enabled us to leverage Hyperdisk with precisely the IOPS we need for our stateful workloads, regardless of instance size. This flexibility gives us the best of both worlds - allowing us to win more ad auctions for our clients while significantly improving our margins. We're now testing the N4A family by running some of our key workloads that require the most flexibility, such as our API relay service. We are happy to share that several applications running in production are consuming 15% less CPU compared to our previous infrastructure, reducing our costs further, while ensuring that the right instance backs the workload characteristics required.”&lt;/i&gt; - &lt;b&gt;Or Ben Dahan, Cloud &amp;amp; Software Architect at Rise&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Get started with N4A today&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4A is available in the following Google Cloud regions: &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;us-central1 (Iowa)&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;us-east4 (Virginia)&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, us-east1 (South Carolina), us-west1 (Oregon), asia-southeast1 (Singapore), europe-west1 (Belgium), europe-west2 (London), &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;europe-west3 (Frankfurt) &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;and &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;europe-west4 (Netherlands)&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; with more regions to follow. Learn&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; more about N4A &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/general-purpose-machines#n4a_series"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here in documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;; deploy N4A &lt;/span&gt;&lt;a href="http://console.cloud.google.com/compute/instancesAdd"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here in the console&lt;/span&gt;&lt;/a&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Tue, 27 Jan 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/axion-based-n4a-vms-now-in-preview/</guid><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Unlock 2x better price-performance with Axion-based N4A VMs, now generally available</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/axion-based-n4a-vms-now-in-preview/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Nate Baum</name><title>Senior Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mo Farhat</name><title>Group Product Manager</title><department></department><company></company></author></item><item><title>Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo</title><link>https://cloud.google.com/blog/products/compute/scaling-moe-inference-with-nvidia-dynamo-on-google-cloud-a4x/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As organizations transition from standard LLMs to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;massive Mixture-of-Experts (MoE) &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;architectures like DeepSeek-R1, the primary constraint has shifted from raw compute density to communication latency and memory bandwidth. Today, we’re releasing two new validated recipes designed to help customers overcome the infrastructure bottlenecks of the agentic AI era. These new recipes provide clear steps to optimize both throughput and latency built on the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;A4X machine series&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; powered by &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA GB200 NVL72&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA Dynamo&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, which extend the &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/ai-inference-recipe-using-nvidia-dynamo-with-ai-hypercomputer?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;reference architecture&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; we published in September 2025 for disaggregated inference on A3 Ultra (NVIDIA H200) VMs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We’re bringing the best of both worlds to AI infrastructure by combining the multi-layered scalability of Google Cloud’s AI infrastructure with the rack-scale acceleration of the A4X. These recipes are part of a broader collaboration between our organizations that includes investments in important inference infrastructure like &lt;/span&gt;&lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Dynamic Resource Allocation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (DRA) and &lt;/span&gt;&lt;a href="https://gateway-api-inference-extension.sigs.k8s.io/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Highlights of the updated reference architecture include: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Infrastructure:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Google Cloud’s A4X machine series, powered by NVIDIA GB200 NVL72, creating a single 72-GPU compute domain connected with fifth-generation NVIDIA NVLink.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Serving architecture:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; NVIDIA Dynamo functions as the distributed runtime, managing KV cache state and kernel scheduling across the rack-scale fabric.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Performance: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;For 8K/1K input sequence length (ISL)/ output sequence length (OSL) , we achieved &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;over 6K total tokens/sec/GPU&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; in throughput-optimized configurations and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;10ms inter-token latency (ITL)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; in latency-optimized configurations.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span&gt;&lt;strong style="vertical-align: baseline;"&gt;Deployment:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Two new recipes are available today in &lt;/span&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/inference/a4x/disaggregated-serving/dynamo/README.md" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;this repo&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for deploying this stack on Google Cloud using Google Kubernetes Engine (GKE) for orchestration.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The modern inference stack&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To achieve exascale performance, inference cannot be treated as a monolithic workload. It requires a modular architecture where every layer is optimized for specific throughput and latency targets. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;The AI Hypercomputer inference stack consists of three distinct layers:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Infrastructure layer:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The physical compute, networking, and storage fabric (e.g., A4X).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Serving layer:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The specific model architecture and the optimized execution kernels (e.g., NVIDIA Dynamo, NVIDIA TensorRT-LLM, Pax) and runtime environment managing request scheduling, KV cache state, and distributed coordination.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Orchestration layer:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The control plane for resource lifecycle management, scaling, and fault tolerance (e.g., Kubernetes).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the reference architecture detailed below, we focus on a specific, high-performance instantiation of this stack designed for the NVIDIA ecosystem. We combine the A4X at the infrastructure layer with NVIDIA&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Dynamo at the model serving Layer, orchestrated by GKE.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Infrastructure layer: The A4X rack-scale architecture&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In our &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/new-a4x-vms-powered-by-nvidia-gb200-gpus?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;A4X launch announcement&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in February 2025 we referenced how the A4X VM addressed bandwidth constraints by implementing the GB200 NVL72 architecture, which fundamentally alters the topology available to the scheduler.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Unlike previous generations where NVLink domains were bound by the server chassis (typically 8 GPUs), the A4X exposes a unified fabric, with:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;72 NVIDIA Blackwell GPUs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; interconnected via the NVLink Switch System that enables the 72 GPUs to operate as one giant GPU with unified shared memory&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;130TB/s aggregate bandwidth&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, enabling all-to-all communication with latency profiles comparable to on-board memory access (72 GPUs x 1.8 TB/s/GPU)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Native NVFP4 support:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Blackwell Tensor Cores support 4-bit floating point precision, effectively doubling throughput relative to FP8 for compatible model layers. We used &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;FP8 Precision Scaling&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for this benchmark to support configuration and comparison with previously published results.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Serving layer: NVIDIA Dynamo&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Hardware of this scale requires a runtime capable of managing distributed state without introducing synchronization overhead. NVIDIA Dynamo serves as this distributed inference runtime. It moves beyond simple model serving to coordinate the complex lifecycle of inference requests across the underlying infrastructure.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The serving layer optimizes utilization on the A4X through these specific mechanisms:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Wide Expert Parallelism (WideEP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Traditional MoE serving shards experts within a single node (typically 8 GPUs), leading to load imbalances when specific experts become "hot." We use the A4X's unified fabric to distribute experts across the full 72-GPU rack. This WideEP configuration absorbs bursty expert activation patterns by balancing the load across a massive compute pool, helping to ensure that no single GPU becomes a straggler.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Deep Expert Parallelism (&lt;/strong&gt;&lt;a href="https://github.com/deepseek-ai/DeepEP" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;DeepEP&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: While WideEP distributes the experts, DeepEP optimizes the critical "dispatch" and "combine" communication phases. DeepEP accelerates the high-bandwidth all-to-all operations required to route tokens to their assigned experts. This approach minimizes the synchronization overhead that typically bottlenecks MoE inference at scale.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Disaggregated request processing:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Dynamo decouples the compute-bound prefill phase from the memory-bound decode phase. On the A4X, this allows the scheduler to allocate specific GPU groups within the rack to prefill (maximizing tensor core saturation) while other GPUs handle decode (maximizing memory bandwidth utilization), preventing resource contention.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Global KV cache management:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Dynamo maintains a global view of the KV cache state. Its routing logic directs requests to the specific GPU holding the relevant context, minimizing redundant computation and cache migration.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;JIT kernel optimization:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The runtime leverages NVIDIA Blackwell-specific kernels, performing just-in-time fusion of operations to reduce memory-access overhead during the generation phase.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Orchestration layer: Mapping software to hardware&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;While the A4X provides the physical fabric and Dynamo provides the runtime logic, the orchestration layer is responsible for mapping the software requirements to the hardware topology. For rack-scale architectures like the GB200 NVL72, container orchestration needs to evolve beyond standard scheduling. By making the orchestrator explicitly aware of the physical NVLink domains, we can fully unlock the platform’s performance and help ensure optimal workload placement.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;GKE enforces this hardware-software alignment through these specific mechanisms:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Rack-level atomic scheduling:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; With GB200 NVL72, the  "unit of compute" is no longer a single GPU or a single node — the entire rack is the new fundamental building block of accelerated computing. We use GKE capacity reservations with specific affinity settings. This targets a reserved block of A4X infrastructure that guarantees dense deployment. By consuming this reservation, GKE helps ensure that all pods comprising a Dynamo instance land on the specific, physically contiguous rack hardware required to establish the NVLink domain, providing the hard topology guarantee needed for WideEP and DeepEP.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Low-latency model loading via GCS FUSE: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Serving massive MoE models requires loading terabytes of weights into high-bandwidth memory (HBM). Traditional approaches that download weights to local disk incur unacceptable "cold start" latencies. We leverage the &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GCS FUSE CSI Driver&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to mount model weights directly from Google Cloud Storage as a local file system. This allows the Dynamo runtime to "lazy load" the model, streaming data chunks directly into GPU memory on demand. This approach eliminates the pre-download phase, significantly reducing the time-to-ready for new inference replicas and enabling faster auto-scaling in response to traffic bursts.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3. Kernel-bypass networking (GPUDirect RDMA): &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;To maximize the aggregate 130 TB/s bandwidth of the A4X, the networking stack must minimize CPU and I/O involvement. We configure the GKE cluster to enable&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;GPUDirect RDMA&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;over the Titanium network adapter. By injecting specific NCCL topology configurations and enabling IPC_LOCK capabilities in the container, we allow the application to bypass the OS kernel and perform Direct Memory Access (DMA) operations between the GPU and the network interface. This configuration offloads the NVIDIA Grace CPU from data path management, so that networking I/O does not become a bottleneck during high-throughput token generation.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Performance validation&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We observed the following when assessing the scaling characteristics of an 8K/1K workload on DeepSeek-R1 (FP8) with SGLang for two distinct optimization targets. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Throughput-optimized configuration&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Setup:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; All 72 GPUs utilizing DeepEP. 10 prefill nodes with 5 workers (TP8) and 8 decode nodes with 1 worker (TP32).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Result:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We sustained over &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;6K total tokens/sec/GPU&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (1.5K output tokens/sec/GPU), which matches the performance published by InferenceMAX (&lt;/span&gt;&lt;a href="https://github.com/InferenceMAX/InferenceMAX/actions/runs/20356790608/job/58493812121" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;source&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Latency-optimized configuration&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Setup:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; 8 GPUs (two nodes) without DeepEP. 1 prefill node with 1 prefill worker (TP4) and 1 decode node with 1 decode worker (TP4). &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Result:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We sustained a median Inter-Token Latency (ITL) of &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;10ms&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; at a concurrency of 4, which matches the performance published by InferenceMAX (&lt;/span&gt;&lt;a href="https://github.com/InferenceMAX/InferenceMAX/actions/runs/20413316138/job/58653323053" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;source&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Looking ahead&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As models evolve from static chat interfaces to complex, multi-turn reasoning agents, the requirements for inference infrastructure will continue to shift. We are actively updating and releasing benchmarks and recipes as we invest across all three layers of the AI inference stack to meet these demands:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Infrastructure layer&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/now-shipping-a4x-max-vertex-ai-training-and-more?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;recently released A4X Max&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is based on the NVIDIA GB300 NVL72 in a single 72 GPU rack configuration, bringing 1.5X more NVFP4 FLOPs, 1.5X more GPU memory, and 2X higher network bandwidth compared to A4X. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Serving layer&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: We are actively exploring deeper integrations with components of NVIDIA Dynamo, e.g., pairing KV Block Manager with Google Cloud remote storage, funneling Dynamo metrics into our Cloud Monitoring dashboards for enhanced observability, and leveraging GKE Custom Compute Classes (CCC) for better capacity and obtainability, as well as setting a new baseline with FP4 precision.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Orchestration&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: We plan to incorporate additional optimizations into these tests, e.g. &lt;/span&gt;&lt;a href="https://gateway-api-inference-extension.sigs.k8s.io/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; as the intelligent inference scheduling component, following the design patterns established in the llm-d &lt;/span&gt;&lt;a href="https://llm-d.ai/docs/guide" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;well-lit paths&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. We aim to provide a centralized mechanism for sophisticated traffic orchestration — handling request prioritization, queuing, and multi-model routing before the workload ever reaches the serving-layer runtime.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Whether you are deploying massive MoE models or architecting the next generation of reasoning agents, this stack provides the exascale foundation required to turn frontier research into production reality. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Get started today&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At Google Cloud, we’re committed to providing the most open, flexible, and performant infrastructure for your AI workloads. With full support for the NVIDIA Dynamo suite — from intelligent routing and scaling to the latest NVIDIA AI infrastructure — we provide a complete, production-ready solution for serving LLMs at scale.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We updated our deployment repository with two specific recipes for the A4X machine class: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/inference/a4x/disaggregated-serving/dynamo/README.md#32-sglang-deployment-with-deepep-72-gpus" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Recipe for throughput optimized&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; - 72 GPUs with DeepEP&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/inference/a4x/disaggregated-serving/dynamo/README.md#sglang-wo-deepep" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Recipe for latency optimized&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; - 8 GPUs without DeepEP&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We look forward to seeing what you build!&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 22 Jan 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/scaling-moe-inference-with-nvidia-dynamo-on-google-cloud-a4x/</guid><category>AI &amp; Machine Learning</category><category>AI Hypercomputer</category><category>GKE</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/scaling-moe-inference-with-nvidia-dynamo-on-google-cloud-a4x/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Sean Horgan</name><title>Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Ling Lin</name><title>Software Engineer</title><department></department><company></company></author></item><item><title>Simplify VM OS agent management at scale: Introducing VM Extensions Manager</title><link>https://cloud.google.com/blog/products/compute/introducing-vm-extensions-manager/</link><description>&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_d395npc.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If you're an IT administrator, you know that managing Operating System (OS) agents (Google calls them extensions) across a large fleet of VM instances can be complex and frustrating. Indeed, this operational overhead can be a major barrier to adopting extension-based services on VM fleets, despite the fact that they unlock powerful application-level capabilities.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To solve this problem, we’re excited to announce the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;preview of&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;VM Extensions Manager&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, a new capability integrated directly into the Compute Engine API that simplifies installing and managing these Google-provided extensions.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;VM Extensions Manager provides a centralized, policy-driven framework for managing the entire lifecycle of Google Cloud extensions on your VM instances. Instead of relying on manual scripts, startup scripts, or other bespoke solutions, you can now define a policy to ensure all your VM instances — both existing and new — conform to that state, reducing operational overhead from months to hours.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;How to get started with VM Extensions Manager&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;VM Extensions Manager is integrated directly into the &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;compute.googleapis.com&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; API, meaning there are no new APIs to discover or enable. You can get started in minutes.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Define your extension policy&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;First, define a policy that specifies the desired state of your extensions.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For the preview, you can create &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;zonal policies&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; at the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Project level&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. This policy targets VM instances within a single, specific zone. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Over the coming months, we’ll expand support to include &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;global policies&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, as well as policies at the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Organization&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Folder levels&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. This will allow you to build a flexible hierarchy of policies (using priorities) to manage your extension on your enterprise fleet from a single control plane.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can create this policy directly from the Google Cloud console: &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_2Dllyl3.max-1000x1000.png"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Demo of Creating VM Extension policy using Cloud Console&lt;/strong&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/3_Bayaqjl.gif"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Select your extensions&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;In the policy, you select the Google Cloud extensions you want to manage. For the preview, VM Extensions Manager supports several critical Google Cloud extensions, including:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/logging/docs/agent/ops-agent/agent-vmem-policies"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Ops Agent&lt;/strong&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;(ops-agent): The Ops Agent is the primary agent for collecting telemetry from your Compute Engine instances.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/workload-manager/docs/evaluate/set-up-agent-for-sap"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Agent for SAP&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (sap-extension): Google Cloud's Agent for SAP is provided by Google Cloud for the support and monitoring of SAP workloads running on Compute Engine instances and Bare Metal Solution servers.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/agent-for-compute-workloads"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Agent for Compute Workload&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (workload-extension): The Agent for Compute Workloads lets you monitor and evaluate workloads running on Compute Engine.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We'll be adding support for more extension-based services in the coming months.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can choose to pin a specific extension version or, keep it empty (the default) to get the latest extension installed. If you choose the default, VM Extensions Manager automatically handles the rollout of new versions as they are released — no more waiting to access new features and improvements.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3. Roll out global policy with more control&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;VM Extensions Manager gives you control over how global policy changes are deployed across many zones with rollout speeds. Zonal policies don't offer rollout speeds; they are enforced instantaneously when the VMs are online.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In coming weeks, we will expand support for global policy via gcloud first and update the documentation with relevant information. UI updates will follow in coming months. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At preview, however, global policy lets you select two distinct rollout speeds:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;SLOW&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt; (Recommended):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; This is the default option, designed for safety. It orchestrates a zone-by-zone rollout (within the scope of the policy) with a built-in wait time between waves, minimizing the potential blast radius of a problematic change over a period of time, by default 5 days. This is perfect for standard maintenance and updates.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;FAST&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; This option eliminates the wait time between waves, executing the change across the entire fleet across zones as quickly as possible. It is intended for urgent use cases, such as deploying a critical security patch in a "break-glass" emergency scenario across all VMs in all zones.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Once you save the policy, VM Extensions Manager takes over. The underlying progressive rollout engine manages the complex orchestration, and you can monitor its progress.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A flexible system for standardization and control&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;VM Extensions Manager is designed to bring standardization and control to extensions on your VM fleets. You can start today by applying zonal policies to your projects to ensure extensions are correctly installed on VM instances in the correct zones.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To get started defining Extension policies for your Compute Engine VM instances, read the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/vm-extensions/about-vm-extension-manager"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to create your first policy. We're excited to see how you use VM Extensions Manager to standardize, secure, and simplify the management of your VM fleet.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 05 Jan 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/introducing-vm-extensions-manager/</guid><category>Management Tools</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Simplify VM OS agent management at scale: Introducing VM Extensions Manager</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/introducing-vm-extensions-manager/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Omkar Suram</name><title>Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mike Columbus</name><title>CE Director, Northam Platform Specialists</title><department></department><company></company></author></item><item><title>Automate AI and HPC clusters with Cluster Director, now generally available</title><link>https://cloud.google.com/blog/products/compute/cluster-director-is-now-generally-available/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The complexity of the infrastructure behind AI training and high performance computing (HPC) workloads can really slow teams down. At Google Cloud, where we work with some of the world’s largest AI research teams, we see it everywhere we go: researchers hampered by complex configuration files, platform teams struggling to manage GPUs with home-grown scripts, and operational leads battling the constant, unpredictable hardware failures that derail multi-week training runs. Access to raw compute isn't enough. To operate at the cutting edge, you need &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;reliability&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; that survives hardware failures, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;orchestration&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; that respects topology, and a &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;lifecycle&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; management strategy that adapts to evolving needs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we are delivering on those requirements with the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;General Availability (GA) of &lt;/strong&gt;&lt;a href="https://cloud.google.com/products/cluster-director" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cluster Director&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Preview&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; of Cluster Director support for Slurm on &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine?utm_source=google&amp;amp;utm_medium=cpc&amp;amp;utm_campaign=na-CA-all-en-dr-bkws-all-all-trial-e-dr-1710134&amp;amp;utm_content=text-ad-none-any-DEV_c-CRE_772382725406-ADGP_Hybrid+%7C+BKWS+-+EXA+%7C+Txt-AppMod-GKE-Kubernetes+Engine-KWID_335784956140-kwd-335784956140&amp;amp;utm_term=KW_kubernetes+google-ST_kubernetes+google&amp;amp;gclsrc=aw.ds&amp;amp;gad_source=1&amp;amp;gad_campaignid=22976548925&amp;amp;gclid=Cj0KCQiAgP_JBhD-ARIsANpEMxxNCV54Smw89kgAplcXoolCw8LdVBSA9buRDhHT_4QlTybV4LZoqKIaAqJcEALw_wcB&amp;amp;e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Kubernetes Engine (GKE)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Cluster Director (GA) is a managed infrastructure service designed to meet the rigorous demands of modern supercomputing. It replaces fragile DIY tooling with a robust topology-aware control plane that handles the entire lifecycle of Slurm clusters, from the first deployment to the thousandth training run. &lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;We are expanding Cluster Director to support Slurm on GKE (Preview), designed to give you the best of both worlds: the familiar precision of high-performance scheduling and the automated scale of Kubernetes. It achieves this by treating GKE node pools as a direct compute resource for your Slurm cluster, allowing you to scale your workloads with Kubernetes' power without changing your existing Slurm workflows.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Cluster Director, now GA&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Cluster Director offers advanced capabilities at each phase of the cluster lifecycle, spanning preparation (Day 0), where infrastructure design and capacity are determined; deployment (Day 1), where the cluster is automatically deployed and configured; and monitoring (Day 2), where performance, health, and optimization are continuously tracked.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This holistic approach ensures that you get the benefits of fully configurable infrastructure while automating lower-level operations so your compute resources are always optimized, reliable, and available. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;So, what does all this cost? That’s the best part. There's no extra charge to use Cluster Director. You only pay for the underlying Google Cloud resources — your compute, storage, and networking.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;How Cluster Director supports each phase of deployment&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Day 0: Preparation &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Standing up a cluster typically involves weeks of planning, wrangling Terraform, and debugging the network. Cluster Director changes the ‘Day 0’ experience entirely, with tools for designing infrastructure topology that’s optimized for your workload requirements. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/1_gBjYYUA.gif"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To streamline your Day 0 setup, Cluster Director provides:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Reference architectures:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We’ve codified Google’s internal best practices into reusable cluster templates, enabling you to spin up standardized, validated clusters in minutes. This helps ensure that every team in your organization is using the same security standards for their deployments and deploying on infrastructure that is configured correctly by default — right down to the network topology and storage mounting. &lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Guided configuration:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We know that having too many options can lead to configuration paralysis. The Cluster Director control plane guides you through a&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;streamlined setup flow. You select your resources, and our system handles the complex backend mapping, ensuring that storage tiers, network fabrics, and compute shapes are compatible and optimized before you deploy.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Broad hardware support:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Cluster Director offers &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/cluster-director/docs/compute"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;full support&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for large-scale AI systems, including Google Cloud’s &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;A4X and A4X Max VMs powered by &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;NVIDIA GB200 and GB300 GPUs, and versatile CPUs such as &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;N2 VMs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for cost-effective login nodes and debugging partitions.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Flexible consumption options:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Cluster Director integrates with your preferred procurement strategy, with support for &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/reservations-overview"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Reservations&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for guaranteed capacity during critical training runs, &lt;/span&gt;&lt;a href="https://cloud.google.com/products/dws/pricing?e=48754805&amp;amp;hl=en"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Dynamic Workload Scheduler&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; Flex-start &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;for dynamic scaling, or &lt;/span&gt;&lt;a href="https://cloud.google.com/solutions/spot-vms?e=48754805&amp;amp;hl=en"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Spot VMs&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for opportunistic low-cost runs.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Google Cloud's Cluster Director is optimized for managing large-scale AI and HPC environments. It complements the power and performance of NVIDIA's accelerated computing platform. Together, we're providing customers with a simplified, powerful, and scalable solution to tackle the next generation of computing challenges.&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; - &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Dave Salvator, Director of Accelerated Computing Products, NVIDIA&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Day 1: Deployment &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Deploying hardware is one thing, but maximizing performance is another thing entirely. Day 1 is the execution phase, where your configuration transforms into a fully operational cluster. The good news is that Cluster Director doesn't just provision VMs, it validates that your software and hardware components are healthy, properly networked, and ready to accept the first workload.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/2_MyVTseY.gif"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To ensure a high-performance deployment, Cluster Director automates:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Getting a clean "bill of health":&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Before your job ever touches a GPU, Cluster Director runs a rigorous suite of health checks, including &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;DCGMI&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; diagnostics and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NCCL&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; performance validation, to verify the integrity of the network, storage, and accelerators.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Keeping accelerators fed with data:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Storage throughput is often the silent killer of training efficiency. That’s why Cluster Director fully supports Google Cloud Managed Lustre with selectable performance tiers, allowing you to attach high-throughput parallel storage directly to your compute nodes, so your GPUs are never starved for data.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Maximizing Interconnect Performance:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; To achieve peak scaling, Cluster Director implements topology-aware scheduling and compact placement policies. By utilizing dense reservations on Google’s non-blocking fabric, the system ensures that your distributed workloads are placed on the shortest physical path possible, minimizing tail latency and maximizing collective communication (NCCL) speeds from the get-go.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“Cluster Director is an amazing product, which has enabled me to spin up a ready to use Nvidia GPU cluster with Slurm, including all networking, routing, and high performance network file-system for large-scale distributed model training within less than an hour. The cluster was immediately ready to run our containerizedAI training workloads with excellent throughput with only minimal customization effort."&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; - Dr. Florian Eyben, Head of AI Foundation Models &amp;amp; Speech Technology, Agile Robots SE, Munich, Germany&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Day 2: Monitoring&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The reality of AI and HPC infrastructure is that hardware fails and requirements change. A rigid cluster is an inefficient cluster. As you move into the ongoing “Day 2” operational phase, you need to maintain cluster health, maximize utilization and performance. Cluster Director provides a control plane equipped for the complexities of long-term operations. Today we are introducing new &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;active cluster management&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; capabilities to handle the messy reality of Day 2 operations.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/3_VSuBKiw.gif"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;New active cluster management capabilities include:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Topology-level visibility:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; You can’t orchestrate what you can’t see. Cluster Director’s observability graphs and topology grids let you visualize your entire fleet, spot thermal throttles or interconnect issues, and optimize job placement based on physical proximity.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;One-click remediation:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; When a node degrades, you shouldn't have to SSH in to debug it. Cluster Director allows you to replace faulty nodes with a single click directly from the Google Cloud console. The system handles the draining, teardown, and replacement, returning your cluster to full capacity in minutes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Adaptive infrastructure:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; When your research needs change, so should your cluster. You can now modify active clusters, with activities such as adding or removing storage filesystems, on the fly, without tearing down the cluster or interrupting ongoing work.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Cluster Director support for Slurm on GKE, now in preview&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Innovation thrives in the open. Google, the creator of Kubernetes, and SchedMD, the developers behind Slurm, have long championed the open-source technologies that power the world's most advanced computing. For years, NVIDIA and SchedMD have worked in lockstep to optimize GPU scheduling, introducing foundational features like the Generic Resource (GRES) framework and Multi-Instance GPU (MIG) support that are essential for modern AI. By acquiring SchedMD, NVIDIA is doubling down on its commitment to Slurm as a vendor-neutral standard, ensuring that the software powering the world's fastest supercomputers remains open, performant, and perfectly tuned for the future of accelerated computing.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Building on this foundation of accelerated computing, Google is deepening its collaboration with SchedMD to answer a fundamental industry challenge: how to bridge the gap between cloud-native orchestration and high-performance scheduling. We are excited to announce the Preview of Cluster Director support for Slurm on GKE, utilizing SchedMD’s Slinky offering.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This initiative brings together the two standards of the infrastructure world. By running a native Slurm cluster directly on top of GKE, we are amplifying the strengths of both communities:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Researchers &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;get the uncompromised Slurm interface and batch capabilities, such as &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;sbatch&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;squeue&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, that have defined HPC for decades.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Platform teams&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; gain the operational velocity that GKE, with its auto-scaling, self-healing, and bin-packing, brings to the table.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Slurm on GKE is strengthened by our long-standing partnership with SchedMD, which helps create a unified, open, and powerful foundation for the next generation of AI and HPC workloads. &lt;/span&gt;&lt;a href="https://forms.gle/LaV116jNy2CvAnNV8" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Request preview access now&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Try Cluster Director today&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to start using Cluster Director for your AI and HPC cluster automation? &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Learn more about the end-to-end capabilities in &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/cluster-director/docs"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style="vertical-align: baseline;"&gt;Activate &lt;/span&gt;&lt;a href="http://console.cloud.google.com/cluster-director"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cluster Director&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in the console.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;</description><pubDate>Wed, 17 Dec 2025 18:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/cluster-director-is-now-generally-available/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Automate AI and HPC clusters with Cluster Director, now generally available</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/cluster-director-is-now-generally-available/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Ilias Katsardis</name><title>Sr. Product Manager, Cluster Director, Google Cloud</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Jason Monden</name><title>Group Product Manager, AI Infrastructure, Google Cloud</title><department></department><company></company></author></item><item><title>Google named a Leader in The Forrester Wave™: AI Infrastructure Solutions, Q4 2025</title><link>https://cloud.google.com/blog/products/compute/forrester-wave-ai-infrastructure-solutions-q4-2025-leader/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For most organizations, the question is no longer &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;if&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; they will use AI, but &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;how&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; to scale it from a promising prototype into a production-grade service that drives business outcomes. In this age of inference, competitive advantage is defined by your ability to serve useful information to users around the world at the lowest possible cost. As you move from demos to production deployments at scale, you need to simplify infrastructure operations with integrated systems that provide the latest AI software and accelerator hardware platforms, while keeping costs and architectural complexity low. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Yesterday, Forrester released &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;The Forrester Wave™: AI Infrastructure Solutions, Q4 2025&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; report, evaluating 13 vendors, and we believe their findings validate our commitment to solving these core challenges. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Google received the highest score of all vendors in the Current Offering category &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;and received the highest possible score in 16 out of 19 evaluation criteria, including, but not limited to: Vision, Architecture, Training, Inferencing, Efficiency, and Security.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://cloud.google.com/resources/content/2025-forrester-wave-ai-infrastructure"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Access the full report&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;The Forrester Wave™: AI Infrastructure Solutions, Q4 2025&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Accelerating time-to-value with an integrated system&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Enterprises don’t run AI in a vacuum. They need to integrate it with a diverse range of applications and databases while adhering to stringent security protocols. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Forrester recognized Google Cloud’s strategy of co-design by giving us the highest possible score in the Efficiency and Scalability criteria:&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“Google pursues a strategy of silicon-infrastructure co-design. It develops TPUs to improve inference efficiency and NVIDIA GPUs for access to broader ecosystem compatibility. Google designs TPUs to integrate tightly with its networking fabric, giving customers high bandwidth and low latency for inference at scale.”&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For over two decades, we have operated some of the world's largest services, from Google Search and YouTube to Maps, where their unprecedented scale required us to solve problems that no one else had. We couldn't simply buy the platform and infrastructure we needed; we had to invent it. This led to a decade-long journey of deep, system-level co-design, building everything from our custom network fabric and specialized accelerators to frontier models, all under one roof. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The result was an integrated supercomputing system, AI Hypercomputer, which has paid significant dividends for our customers. It supports a wide range of AI-optimized hardware, allowing you to optimize for granular, workload-level objectives — whether that's higher throughput, lower latency, faster time-to-results, or lower TCO. That means you can use our custom &lt;/span&gt;&lt;a href="https://cloud.google.com/tpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Tensor Processing Units&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (TPUs), the latest &lt;/span&gt;&lt;a href="https://cloud.google.com/gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NVIDIA GPUs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, or both, backed by a system that tightly integrates accelerators with networking and storage for exceptional performance and efficiency. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;It’s&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; also why today, leading generative AI companies such as Anthropic, Lightricks, and LG AI Research trust Google Cloud to power their most demanding AI workloads.&lt;sup&gt;1&lt;/sup&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This system-level integration lays the foundation for speed, but operational complexity could still slow you down. To accelerate your time-to-market, we provide multiple ways to deploy and manage AI infrastructure, abstracting away the heavy lifting regardless of your preferred workflow. Google Kubernetes Engine (GKE) Autopilot automates management for containerized applications, helping customers like LiveX.AI reduce operational costs by 66%. Similarly, Cluster Director simplifies deployment for Slurm-based environments, enabling customers like LG AI Research to slash setup time from 10 days to under one day. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Managing AI cost and complexity&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Forrester gave Google Cloud the highest scores possible in the Pricing Flexibility and Transparency criterion. The price of compute is only one part of the AI infrastructure cost equation. A complete view should also account for development costs, downtime and inefficient resource utilization. We offer optionality at every layer of the stack to provide the flexibility businesses demand.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Flexible consumption:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Dynamic Workload Scheduler allows you to secure compute at up to 50% savings, by ensuring you only pay for the capacity you need, when you need it.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Load balancing&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: GKE Inference Gateway improves throughput by using AI-aware routing to balance requests across models, preventing bottlenecks and ensuring servers aren't sitting idle.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Eliminating data bottlenecks&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Anywhere Cache co-locates data with compute, reducing read latency by up to 96% and eliminating the "integration tax" of moving data. By using Anywhere Cache together with our unified data platform BigQuery, you can avoid latency and egress fees while keeping your accelerators fed with data. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Mitigating strategic risk through flexibility and choice&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are also committed to enabling customer choice across accelerators, frameworks and multicloud environments. This isn’t new for us. Our deep experience with Kubernetes, which we developed then open-sourced, taught us that open ecosystems are the fastest path to innovation and provide our customers with the most flexibility. We are bringing that same ethos to the AI era by actively contributing to the tools you already use.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Open source frameworks and hardware portability:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We continue to support open frameworks such as PyTorch, JAX, and Keras. We’ve also directly addressed concerns about workload portability on custom silicon by investing in TPU support for vLLM, allowing developers to easily switch between TPUs and GPUs (or use both) with only minimal configuration changes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hybrid and multicloud flexibility:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Our commitment to choice extends to where you run your applications. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Distributed Cloud&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; brings our services to on-premises, edge and cloud locations, while &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Cross-Cloud Network&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; securely connects applications and users with high-speed connectivity between your environments and other clouds. This powerful combination means you're no longer locked into a specific environment; you can easily migrate workloads and apply uniform management practices, streamlining operations, and mitigating the risk of lock-in.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Systems you can rely on&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When your entire business model depends on the availability of AI services, infrastructure uptime is critical. Google Cloud's global infrastructure is engineered for enterprise-grade reliability, an approach rooted in our history as the birthplace of Site Reliability Engineering (SRE).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We operate one of the world's largest private software-defined networks, handling approximately 25% of global internet egress traffic. Unlike providers that rely on the public internet, we keep your traffic on Google’s own fiber to improve speed, reliability, and latency. This global backbone is powered by our Jupiter data center fabric, which scales to 13 Petabits/sec of bandwidth, delivering 50x greater reliability than previous generations — to say nothing of other providers. Finally, to improve cluster-level fault tolerance, we employ capabilities like &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;elastic training and multi-tier checkpointing&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which allow jobs to continue uninterrupted, by dynamically resizing the cluster around failed nodes while minimizing the time to recovery.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Building on a secure foundation&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our approach is to secure AI from the ground up. In fact, Google Cloud maintains a leading track record for cloud security. Independent analysis from cloudvulndb.org (2024-2025) shows that our platform has up to 70% fewer critical and high vulnerabilities compared to the other two leading cloud providers. We were also the first in the industry to publish an AI/ML Privacy Commitment, which guarantees that we do not use your data to train our models. With those safeguards in place, security is integrated into the foundation of Google Cloud, based on the zero-trust principles that protect Google’s own services:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;A hardware root of trust:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Our custom Titan chips, as part of our Titanium architecture, create a verifiable hardware root of trust. We recently extended this with Titanium Intelligence Enclaves for &lt;/span&gt;&lt;a href="https://blog.google/technology/ai/google-private-ai-compute/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Private AI Compute&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, allowing you to process sensitive data in a hardened, isolated, and encrypted environment.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Built-in AI security:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;a href="https://cloud.google.com/security/products/security-command-center"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Security Command Center (SCC)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; natively integrates with our infrastructure, providing &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/identity-security/introducing-ai-protection-security-for-the-ai-era"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AI Protection&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; by automatically discovering assets, preventing security issues, detecting active threats with frontline &lt;/span&gt;&lt;a href="https://cloud.google.com/security/products/threat-intelligence"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Threat Intelligence&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, and discovering known and unknown risks before attackers can exploit them.  &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Sovereign solutions:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We enable you to meet stringent data residency, operational control, and software sovereignty requirements through solutions like &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Data Boundary&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. This is complemented by flexible options like partner-operated sovereign controls and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Distributed Cloud&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for air-gapped needs.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Platform controls for AI and agent governance: &lt;/strong&gt;&lt;a href="https://cloud.google.com/vertex-ai"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; provides the essential governance layer for the enterprise builder to deploy models and agents at scale. This trust is anchored in Google Cloud’s secure-by-default infrastructure, utilizing platform controls like &lt;/span&gt;&lt;a href="https://cloud.google.com/security/vpc-service-controls"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;VPC Service Controls (VPC-SC)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kms/docs/cmek"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Customer-Managed Encryption Keys (CMEK)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to sandbox environments and protect sensitive data, and Agent Identity for granular IAM permissions. At the platform level, Vertex AI and &lt;/span&gt;&lt;a href="https://cloud.google.com/products/agent-builder"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent Builder&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; integrate &lt;/span&gt;&lt;a href="https://cloud.google.com/security/products/model-armor"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Model Armor&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to provide runtime protection against emergent agentic threats, such as prompt injection and data exfiltration. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Delivering continuous AI innovation&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We are honored to be recognized as a Leader in The Forrester Wave™ report, which we believe validates decades of R&amp;amp;D and our approach to building ultra-scale AI infrastructure. Look to us to continue on this path of system-level innovation as we help you convert the promise of AI into a reality.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Access the full report:&lt;/strong&gt; &lt;a href="https://cloud.google.com/resources/content/2025-forrester-wave-ai-infrastructure"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;The Forrester Wave™: AI Infrastructure Solutions, Q4 2025&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sup&gt;&lt;em&gt;1. IDC Business Value Snapshot, Sponsored by Google Cloud, The Business Value of Google Cloud AI Hypercomputer, US53855425, October 2025&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 17 Dec 2025 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/forrester-wave-ai-infrastructure-solutions-q4-2025-leader/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Google named a Leader in The Forrester Wave™: AI Infrastructure Solutions, Q4 2025</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/forrester-wave-ai-infrastructure-solutions-q4-2025-leader/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mark Lohmeyer</name><title>VP and GM, AI and Computing Infrastructure</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Saurabh Tiwary</name><title>VP &amp; GM, Cloud AI</title><department></department><company></company></author></item><item><title>AI agents are here. Is your infrastructure ready?</title><link>https://cloud.google.com/blog/products/compute/idc-on-the-ai-efficiency-gap/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;&lt;strong&gt;Editor’s note&lt;/strong&gt;: Today we hear from Dave McCarthy of IDC about a total cost of ownership crisis for AI infrastructure — and what you can do about it. Read on for his insights.&lt;/span&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The AI landscape is undergoing a seismic shift. For the past few years, the industry has been focused on the massive, resource-intensive process of training generative AI models. But the focus is now rapidly pivoting to a new, even larger challenge: inference.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Inference — the process of using a trained model to make real-time predictions — is no longer just one part of the AI lifecycle; it is quickly becoming the dominant workload. In a recent IDC global survey of over 1,300 AI decision-makers, inference was already cited as the largest AI workload segment, accounting for 47% of all AI operations.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This dominance is driven by the sheer volume of real-world applications. While a model is trained periodically, it is used for inference non-stop, with every user query, API call, and recommendation. It is also critical to recognize that this inference surge will be distributed across hybrid environments. According to IDC survey respondents, 63% of workloads will reside in the cloud, which remains the standard for scalable applications like content creation and chatbots. In contrast, 37% will be deployed on on-premises infrastructure, usually related to use cases such as robotics and other systems that interact directly with the physical world.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now, a new factor is set to multiply this demand: the rise of autonomous and semi-autonomous AI agents.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These "agentic workflows" represent the next logical step in AI, where models don't just respond to a single prompt but execute complex, multi-step tasks. An AI agent might be asked to "plan a trip to Paris," requiring it to perform dozens of interconnected operations: browsing for flights, checking hotel availability, comparing reviews, and mapping locations. Each of these steps is an inference operation, creating a cascade of requests that must be orchestrated across different systems.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This surge in demand is exposing a critical vulnerability for many organizations: the AI efficiency gap.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;The TCO crisis in an age of agents&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The AI efficiency gap is the difference between the theoretical performance of an AI stack and the actual, real-world performance achieved. This gap is the source of a Total Cost of Ownership (TCO) crisis, and it’s driven by system-wide inefficiencies.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our research shows that more than half (54.3%) of organizations use multiple AI frameworks and hardware platforms. While this flexibility seems beneficial, it has a staggering downside: 92% of these organizations report a negative effect on efficiency.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This fragmented "patchwork" approach, stitched together from disparate and non-optimized services, creates a ripple effect of problems:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;41.6% reported increased compute costs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Redundant processes and poor utilization drive up spending.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;40.4% reported increased engineering complexity&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Teams spend more time managing the fragmented stack than delivering value.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;40.0% reported increased latency&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Bottlenecks in one part of the system (like storage or networking) degrade the overall performance of an application.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The core problem is that organizations are paying for expensive, high-performance accelerators, but are failing to keep them busy. Our data shows that 29% of all AI budget waste is tied to inference. This waste is a direct result of idle GPU time (cited by 29.4% of respondents) and inefficient use of resources (22.3%).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When an expensive accelerator is idle, it’s often waiting for data from a slow storage system or for the application server to prepare the next request. This is a system-level failure, not a component failure.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This failure is often compounded by significant hurdles in data management, which serves as the fuel for these AI engines. Survey respondents highlighted three primary challenges contributing to this gap: 47.7% struggle with ensuring data quality and governance, 45.6% grapple with data storage management and related costs, and 44.1% cite the complexity and time required for data cleaning and preparation. When data pipelines cannot keep pace with high-speed accelerators, the entire infrastructure becomes inefficient.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Closing the gap: From fragmented stacks to integrated systems&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To scale cost-effectively in the age of AI agents, we must stop thinking about individual components and start focusing on system-level design.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;An agentic workflow, for example, requires tight coordination between two distinct types of compute:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;General-purpose compute&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This is the operational backbone. It runs the application servers, orchestrates the workflow, pre-processes data, and handles all the logic around the model.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Specialized accelerators&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This is the high-performance engine that runs the AI model itself.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In a fragmented environment, these two sides are inefficiently connected, and latency skyrockets. The path forward is an optimized architecture where the software, networking, storage, and compute — both general-purpose and specialized — are designed to work as a single, cohesive system.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This holistic approach is the only sustainable way to manage the TCO of AI. It redefines the goal away from simply buying faster accelerators and toward improving the overall "price-performance" and "unit economics" of the entire end-to-end workflow. By eliminating bottlenecks and maximizing the utilization of every resource, organizations can finally close the efficiency gap. Organizations are actively shifting strategies to capture this value. Our survey indicates that 28.9% of respondents are prioritizing model optimization techniques, while 26.3% are partnering with AI service providers to navigate this complexity. Additionally, 25% are investing in training to upskill their teams, ensuring they can increase the value of their AI investments.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The age of inference is here, and the age of agents is right behind it. This next wave of innovation will be won not by the organizations with the most powerful accelerators, but by those who build the most efficient, integrated, and cost-effective systems to power them.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A message from Google Cloud&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;We sponsored this IDC research to help IT leaders navigate the critical shift to the "Age of Inference." We recognize that the "efficiency gap" identified here — driven by fragmented stacks and idle resources — is the primary barrier to sustainable ROI. That is why we created AI Hypercomputer: an integrated supercomputer system designed to deliver exceptional performance and efficiency for demanding AI workloads. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;IDC surveyed 1,300 global IT leaders to uncover how they are designing their stack for maximum efficiency and ROI. Get your free copy of the whitepaper to learn more: &lt;/span&gt;&lt;a href="https://cloud.google.com/resources/content/ai-efficiency-gap"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;The AI Efficiency Gap: From TCO Crisis to Optimized Cost and Performance&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 11 Dec 2025 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/idc-on-the-ai-efficiency-gap/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>AI agents are here. Is your infrastructure ready?</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/idc-on-the-ai-efficiency-gap/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Dave McCarthy</name><title>Research Vice President, Cloud and Edge Infrastructure Services, IDC</title><department></department><company></company></author></item><item><title>Nutanix NC2 is now officially supported on Google Cloud</title><link>https://cloud.google.com/blog/topics/partners/nutanix-nc2-generally-available-google-cloud/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we are thrilled to announce Nutanix Cloud Clusters (NC2) is generally available on Google Cloud.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;NC2 on Google Cloud is designed to migrate and modernize specialized, regulated, and mission-critical applications without refactoring your workloads or compromising on performance. This partnership brings the power of Google Cloud’s infrastructure and advanced AI models to your hybrid cloud, without compromising on data residency, connectivity, or operational consistency. You can now run your Nutanix Hybrid Cloud directly on &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/bare-metal-instances"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Compute Engine&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"The General Availability of Nutanix Cloud Clusters (NC2) on Google Cloud is a significant milestone empowering our joint customers to become AI-ready. We are excited to extend the simplicity and resilience of Nutanix NC2 onto Google Cloud's high-performance workload-optimized compute. Nutanix on Google Cloud enables our customers to migrate and modernize their critical workloads while unlocking the full power of Google’s industry-leading data and AI capabilities." &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;- Saveen Pakala, VP, Product Management, Hybrid Cloud, Nutanix&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Nutanix and Google Cloud allow you to maximize agility and minimize disruption for your critical applications. By combining NC2's enterprise flexibility with Google Cloud's power, you gain access to three core advantages. First, your workloads run on Compute Engine’s dynamically scalable workload-optimized infrastructure powering all machine families. Nutanix NC2 supports Compute Engine bare metal instances in the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/storage-optimized-machines?_gl=1*2vt8da*_up*MQ..&amp;amp;gclid=CjwKCAiAqfe8BhBwEiwAsne6gduqCwwkpJZbE9aPtQmusSUIJYOzGeKiVzaE-1_M9aml0iqY5L8_IBoCh90QAvD_BwE&amp;amp;gclsrc=aw.ds#z3_machine_types"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Z3&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/general-purpose-machines#c4_series"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;C4&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; families. These are powered by the &lt;/span&gt;&lt;a href="https://cloud.google.com/titanium"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium offload system&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and leverage &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/disks/local-ssd"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium SSDs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;low-latency, high-throughput storage performance&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;, hosted in Google Cloud with &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;global reach, enterprise-grade security&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;, &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;and commitment to sustainability. Second, you accelerate AI innovation&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; by co-locating data and machine learning services like &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Gemini Enterprise&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; and &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Vertex AI&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;. Finally, you can save costs by dynamically &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;scaling capacity&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; and  utilizing &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;committed use discounts (CUDs)&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; and &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Flex CUDs&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Key use cases to accelerate your cloud journey&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The integration of NC2 on Google Cloud offers flexible, strategic options for hybrid cloud operations. Beyond consolidation and cost control, these capabilities set the stage for true modernization:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Seamless workload migration: Move entire applications between your on-premises Nutanix environment and Google Cloud without re-factoring or re-architecting. &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;This capability saves significant time during data center consolidation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Consistent operations: Maintain the &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;same management plane, security policies, and automation&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; across your private data center and Google Cloud, which dramatically reduces operational complexity and training costs.&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Disaster recovery (DR): Leverage Google Cloud as a robust and cost-efficient recovery target. &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Usage of a minimal “pilot light” cluster reduces compute costs, so you scale up only when a disaster event occurs.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Capacity bursting: Instantly add capacity in the cloud to handle seasonal demands, VDI workloads, development/test &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;cycles&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;, or requirements from mergers and acquisitions (M&amp;amp;A).&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;License portability: Protect your software investments by easily moving your existing &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Nutanix software licenses&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; to Google Cloud as your business needs evolve.&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;“Like many others, we are always on a journey to modernize and shift to achieve the best outcomes for our customers. Nutanix Cloud Clusters (NC2) on Google Cloud brings us a solid platform to continue our hybrid cloud expansion. Our ability to seamlessly run workloads on-premises and on NC2 on Google Cloud without having to re-factor is increasingly valuable as we continue our modernization journey. We look forward to continuing our strong partnership with Google Cloud and Nutanix.” &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;- VP of IT at a global oil &amp;amp; gas company based in Oklahoma&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The architecture &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;NC2 on Compute Engine simplifies building a hybrid cloud by deploying the Nutanix Cloud Infrastructure (NCI) software stack, including the Acropolis Hypervisor (AHV), directly onto high-performance Compute Engine infrastructure.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_dJgDPX1.max-1000x1000.png"
        
          alt="image1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The key components of the solution include:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Compute Engine instances:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; NC2 runs on &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/bare-metal-instances"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Compute Engine bare metal instances&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in the recently introduced C4 and Z3 machine families.&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;These powerful instances provide the foundation with high-density compute, memory, local NVMe storage, and high network bandwidth.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div align="center"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;&lt;table&gt;&lt;colgroup&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;/colgroup&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;Machine Family &lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;GCE Machine Type&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt;vCPUs&lt;/strong&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt;Memory&lt;/strong&gt;  &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt;Storage&lt;/strong&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt;Processor&lt;/strong&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Z3, Storage Optimized &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;z3-highmem-192-highlssd-metal&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;192&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;1536GB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;72TB of NVMe Local SSD&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Intel, Sapphire Rapid&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C4, General Purpose &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;c4-highmem-288-lssd-metal&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;288&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;1080GB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;18TB of NVMe Local SSD&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Intel, Granite Rapid&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C4, General Purpose &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;c4-standard-288-lssd-metal&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;288&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;2232GB&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;18TB of NVMe Local SSD&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Intel, Granite Rapid&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Simplified networking :&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; NC2 runs entirely within your existing Google Cloud Virtual Private Cloud (&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;VPC&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;). Built-in Nutanix Flow Virtual Networking for overlay is integrated to reduce hybrid cloud complexity. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Unified management:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The entire environment, both on-premises and in Google Cloud, is managed through the familiar&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Prism Central&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;console, simplifying day-to-day operations and skill requirements for your IT teams.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Easy procurement:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Later this month, you’ll be able to purchase Nutanix NC2 licensing directly from &lt;/span&gt;&lt;a href="https://cloud.google.com/marketplace?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud Marketplace&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; . This offers a single, unified billing experience for both your Google Cloud infrastructure and Nutanix NC2, in one simple process. A key benefit is the ability to use your existing Google Cloud spend commitments for Nutanix NC2 software. This helps you maximize your investment and streamline your financial operations, providing more value from your cloud budget.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Connect your data to Google Cloud AI and analytics&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A significant modernization opportunity comes from connecting your stable, trusted Nutanix workloads with Google Cloud's powerful data and AI tools. Your applications running on NC2 can tap directly into services like &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;BigQuery&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Vertex AI&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; with low latency, enabling you to:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Derive deeper business value:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Easily send application log data, transactional records, and other operational data from your Nutanix VMs to BigQuery for real-time, scalable data warehousing and complex analysis.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Build custom machine learning models:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Use Vertex&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;AI to create, deploy, and manage custom ML models that analyze data generated by your core applications (e.g., predictive maintenance or fraud detection).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Use conversational AI:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Quickly build and deploy conversational agents using technologies like Dialogflow that interact directly with the application data residing on your NC2 cluster.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Ready to simplify your cloud operations?&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;NC2 on Google Cloud is currently available  across 17 Google Cloud regions, with a planned expansion continuing through 2026. For precise details on regional and zonal availability, please check the official &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/bare-metal-instances#regions_zones"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Compute Engine bare metal regional availability&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; documentation, and reference the &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/all-pricing?e=48754805&amp;amp;hl=en"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Compute Engine pricing page&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for infrastructure costs. To learn more about the solution, try taking a &lt;/span&gt;&lt;a href="https://cloud.nutanixtestdrive.com/login?type=nc2gcp" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;test drive&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or visit &lt;/span&gt;&lt;a href="https://cloud.google.com/find-a-partner/partner/nutanix-inc"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Nutanix partner page&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. Available later this month, you will be able to explore NC2 on Google Cloud licensing through the Google Cloud Marketplace.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Tue, 09 Dec 2025 14:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/partners/nutanix-nc2-generally-available-google-cloud/</guid><category>Compute</category><category>Infrastructure</category><category>Partners</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Nutanix NC2 is now officially supported on Google Cloud</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/partners/nutanix-nc2-generally-available-google-cloud/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Yarden Halperin</name><title>Product Manager, Google Cloud</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Ziv Kalmanovich</name><title>Group Product Manager, Google Cloud</title><department></department><company></company></author></item><item><title>Running high-scale reinforcement learning (RL) for LLMs on GKE</title><link>https://cloud.google.com/blog/products/compute/run-high-scale-rl-for-llms-on-gke/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As Large Language Models (LLMs) evolve, Reinforcement Learning (RL) is becoming the crucial technique for aligning powerful models with human preferences and complex task objectives.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;However, enterprises that need to implement and scale RL for LLMs are facing infrastructure challenges. The primary hurdles include the memory contention from concurrently hosting multiple large models (such as the actor, critic, reward, and reference models), iterative switching between high latency inference generation, and high throughput training phases.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This blog details Google Cloud's full-stack, integrated approach, from custom TPU hardware to the GKE orchestration layer — and shares how you can solve the hybrid, high-stakes demands of RL at scale.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A quick primer: Reinforcement Learning (RL) for LLMs&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;RL is a continuous feedback loop that combines elements of both training and inference. At a high level, the RL loop for LLMs functions as follows:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The LLM generates a response to a given prompt.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;A "reward model" (often trained on human preferences) assigns a quantitative score, or reward, to the output.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;An RL algorithm (e.g., DPO, GRPO) uses this reward signal to update the LLM's parameters, adjusting its policy to generate higher-rewarding outputs in subsequent interactions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This generation, evaluation, and optimization continually improves the LLM's performance based on predefined objectives.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;RL workloads are hybrid and cyclical. The main goal of RL is not to minimize error (training) or fast prediction (inference), but to maximize reward through iterative interaction. The primary constraint for the RL workload is not just the computational power, but also system-wide efficiency, specifically minimizing aggregate sampler latency and maximizing the speed of weight copying for efficient end-to-end step time.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Cloud's full-stack approach to RL&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Solving these system-wide challenges requires an integrated approach. You can't just have fast hardware or a good orchestrator; you need every layer of the stack to work together. Here is how our full-stack approach is built to solve the specific demands of RL:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Flexible, high-performance compute (TPUs and GPUs):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Instead of locking customers into one path, we provide two high-performance options. Our &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;TPU stack&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; is a vertically integrated, JAX-native solution where our custom hardware (excelling at matrix operations) is co-designed with our post-training libraries (MaxText and Tunix). In parallel, we fully support the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;NVIDIA GPU ecosystem&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, partnering with NVIDIA on optimized NeMo RL recipes so customers can leverage their existing expertise directly on GKE.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Holistic, full-stack optimization:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We integrate optimization from the bare metal up. This includes our custom TPU accelerators, high-throughput storage (Managed Lustre, Google Cloud Storage), and — critically — the orchestration and scheduling that GKE provides. By optimizing the entire stack, we can attack the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;system-wide&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; latencies that bottleneck hybrid RL workloads.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3. Leadership in open-source:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; RL infrastructure is complex and built on a wide range of tools. Our leadership starts with open-sourcing Kubernetes and extends to active partnerships with orchestrators like Ray. We contribute to key projects like vLLM, develop open-source solutions like llm-d for cost-effective serving, and open-source our own high-performance MaxText and Tunix libraries. This helps ensure you can integrate the best tools for the job, not just the ones from a single vendor.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;4. Proven, mega-scale orchestration:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Post-training RL can require compute resources that rival pre-training. This requires an orchestration layer that can manage massive, distributed jobs as a single unit. GKE AI mega-clusters support up to 65,000 nodes today, and we are heavily investing in multi-cluster solutions like MultiKueue to scale RL workloads beyond the limits of a single cluster.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Running RL workloads on GKE&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Existing GKE infrastructure is well-suited for demanding RL workloads and provides several infrastructure-level efficiencies. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The image below outlines the architecture and key recommendations for implementing RL at scale. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_HnbQkXW.max-1000x1000.png"
        
          alt="image1"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="drc60"&gt;Figure : GKE infrastructure for running RL&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At the base, the infrastructure layer provides the foundational hardware, including supported compute types (CPUs, GPUs, and TPUs). You can use the Run:ai model streamer to accelerate the model streaming for all three compute types. High performance storage (Managed Lustre, Cloud Storage) can be used for storage needs for RL. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The middle layer is the managed K8s layer powered by GKE, which handles the resource orchestration, resource obtainability using Spot or Dynamic Workload Scheduler, autoscaling, placement, job queuing and job scheduling and more at mega scale. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Finally, the open frameworks layer runs on top of GKE, providing the application and execution environment. This includes the managed support for open-source tools such as KubeRay, Slurm and gVisor sandbox for secure isolated task execution. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Building RL workflow&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Before creating an RL workload, you must first identify a clear use case. With that objective defined, you then architect the core components: selecting the algorithm (e.g, DPO, GRPO), the model server (like vLLM or SGLang), the target GPU/TPU hardware, and other critical configurations.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Next, you can provision a GKE cluster configured with Workload Identity, GCS Fuse, and DGCM metrics. For robust batch processing, install the Kueue and JobSet APIs. We recommend deploying Ray as the orchestrator on top of this GKE stack. From there, you can launch the Nemo RL container, configure it for your GRPO job, and begin monitoring its execution. For the detailed implementation steps and source code, please refer to this &lt;/span&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/RL/a4/recipes/qwen2.5-1.5b/nemoRL" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;repository&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Getting started with RL&lt;/strong&gt;&lt;/h3&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Run RL on GPUs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Try the RL recipe on TPUs using &lt;/span&gt;&lt;a href="https://maxtext.readthedocs.io/en/latest/tutorials/grpo_with_pathways.html" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;MaxText and Pathways&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for GRPO algorithm, or if you use GPUs, try the &lt;/span&gt;&lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/RL/a4/recipes" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NemoRL recipes&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Partner with the open-source ecosystem&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Our leadership in AI is built on open standards like Kubernetes, llm-d, Ray, MaxText or Tunix. We invite you to partner with us to build the future of AI together. Come contribute to llm-d! Join the &lt;/span&gt;&lt;a href="https://llm-d.ai/docs/community" rel="noopener" style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Open Sans', 'Helvetica Neue', sans-serif;" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;llm-d community&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, check out the repository on GitHub, and help us define the future of open-source LLM serving.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;</description><pubDate>Mon, 10 Nov 2025 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/run-high-scale-rl-for-llms-on-gke/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Running high-scale reinforcement learning (RL) for LLMs on GKE</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/run-high-scale-rl-for-llms-on-gke/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Poonam Lamba</name><title>Senior Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Bogdan Berce</name><title>Software Engineer</title><department></department><company></company></author></item><item><title>N4D now GA: Gain up to 3.5x price-performance for scale-out workloads</title><link>https://cloud.google.com/blog/products/compute/n4d-vms-based-on-amd-turin-now-ga/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In today's competitive environment, IT leaders are faced with supporting application scale, rolling out more features, and enabling high-bar customer experiences. This creates a direct and complex challenge: finding the right balance between performance and total cost of ownership (TCO) for the general-purpose workloads that power everyday business operations.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we are announcing the general availability of the N4D machine series, the latest addition to Google Compute Engine’s cost-optimized, general-purpose portfolio. Addressing a wide range of workloads, such as web and application servers, data analytics platforms, and containerized microservices, N4D provides a flexible and price-performant solution.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The N4D machine series combines Google's &lt;/span&gt;&lt;a href="https://cloud.google.com/titanium"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; infrastructure with 5th Gen &lt;/span&gt;&lt;a href="https://www.amd.com/en/products/processors/server/epyc/9005-series.html" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AMD EPYC™ “Turin” processors&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, delivering up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;3.5x the throughput for web-serving workloads&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; vs. the previous-generation N2D. N4D offers predefined shapes of up to 96 vCPUs and 768 GB of DDR5 memory, up to 50 Gbps of networking bandwidth, and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/disks/hyperdisks"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Hyperdisk&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; Balanced and Throughput storage. To deliver a blended cost savings, N4D allows you to move beyond rigid instance sizing for both compute and storage, with &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/instances/creating-instance-with-custom-machine-type"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Custom Machine Types&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to independently configure the exact number of vCPUs and amount of memory, complemented with &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/disks/hyperdisks"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Hyperdisk&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, for tuning disk storage performance and capacity. For the most demanding general purpose workloads, pair N4D together with consistently high performance of &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/c4d-vms-unparalleled-performance-for-business-workloads?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;C4D&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Google Cloud provides workload-optimized infrastructure to ensure the right resources are available for every task. Titanium in particular, with its multi-tier offloads and security capabilities, is foundational to that infrastructure. Titanium offloads networking and storage processing to free up the CPU, and its dedicated SmartNIC manages all I/O, ensuring the AMD EPYC cores are reserved exclusively for your application. Titanium is part of Google Cloud’s vertically integrated stack — from the custom silicon in our servers to our &lt;/span&gt;&lt;a href="https://cloud.google.com/about/locations"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;planet-scale network&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; traversing 7.75 million kilometers of terrestrial and subsea fiber across 42 regions — that is engineered to maximize efficiency and provide the ultra-low latency and high bandwidth to customers at global scale.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A new standard for price-performance&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4D machine series doesn't just inch past the previous N2D generation; it sprints, delivering up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;50% higher price-performance&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for general computing workloads and up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;70% better price-performance &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;for Java workloads. For web-serving workloads, N4D leverages Titanium and AMD’s Turin processors to drive incredible throughput. This results in up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;3.5x the price-performance&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; vs N2D, driving faster response times and a better overall experience for your end-users.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_2hTLTQA.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="4x0iy"&gt;As of October 2025. Performance based on the estimated SPECrate®2017_int_base, estimated SPECjbb2015, and Google internal Nginx Reverse Proxy benchmark scores run in production. Price-performance claims based on published and estimated list prices for Google Cloud.&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/Chronosphere.max-1000x1000.jpg"
        
          alt="Chronosphere"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="f72bn"&gt;&lt;i&gt;“Our edge proxy fleet and internal data pipelines observed a&lt;/i&gt; &lt;b&gt;&lt;i&gt;3-4x performance improvemen&lt;/i&gt;&lt;/b&gt;&lt;i&gt;t on Google Cloud's N4D instances compared to N2D. Our benchmarks also show N4D processes the same workload with significantly greater consistency while using just a fraction of the CPU. This leap in price-performance allows us to efficiently scale our general-purpose workloads, and fits neatly in our fleet alongside more specific Google compute products we leverage.”&lt;/i&gt; - Matt Schallert, Member of Technical Staff, Chronosphere&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/MediaGo.max-1000x1000.jpg"
        
          alt="MediaGo"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="f72bn"&gt;&lt;i&gt;“A&lt;/i&gt; &lt;b&gt;&lt;i&gt;10% increase in throughput while cutting costs by up to 50%&lt;/i&gt;&lt;/b&gt;&lt;i&gt; is a massive win for TCO optimization. That's what we achieved on Google Cloud's N4D machine series. For MediaGo, this efficiency is critical. It allows our AI-driven advertising platform to scale more cost-effectively, directly supporting our mission to maximize ROI for our global partners.”&lt;/i&gt; - MediaGo&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/phoronix.max-1000x1000.jpg"
        
          alt="phoronix"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="f72bn"&gt;&lt;i&gt;"The move from N2D to N4D is a significant generational leap. This&lt;/i&gt;&lt;b&gt;&lt;i&gt; 144.14% performance uplift over 152 tests&lt;/i&gt;&lt;/b&gt;&lt;i&gt; is a testament to Google's Titanium, unlocking the full potential of the new AMD EPYC 'Turin' processors. For those looking for the best possible price-performance in Google Cloud, the N4D instances are a clear winner."&lt;/i&gt; - Michael Larabel, Founder and Principal Author, Phoronix (Read the full study &lt;a href="https://www.phoronix.com/review/google-cloud-n4d-amd-epyc-turin"&gt;here&lt;/a&gt;.)&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/amd_LIvoHWP.max-1000x1000.jpg"
        
          alt="amd"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="f72bn"&gt;&lt;i&gt;"With the launch of the new N4D instances, Google Cloud now offers&lt;/i&gt; &lt;b&gt;&lt;i&gt;the most comprehensive portfolio based on our 5th Gen AMD EPYC processors&lt;/i&gt;&lt;/b&gt;&lt;i&gt;, marking a significant milestone in our strategic partnership. N4D machine series combines the leading performance of AMD CPUs with the uniqueness of Google's Custom Machine Types to deliver a remarkable uplift in price-performance, flexibility, and cost-optimization for everyday workloads. Our benchmark tests confirm this, showing measured performance gains of up to 75% over the previous generation N2D machine series for media encode and transcode workloads."&lt;/i&gt; – Ryan Rodman, Sr Director, Cloud Business Group, AMD&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Complementing C4D machine series&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Earlier this year, we introduced our general-purpose &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/c4d-vms-unparalleled-performance-for-business-workloads"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;C4D machine series&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; built on the same underlying processor as N4D. Its consistently high performance and enterprise features like advanced maintenance support, larger shapes, and our next-gen Titanium Local SSDs, make C4D a great fit for critical workloads. In fact, customers such as &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/c4d-vms-unparalleled-performance-for-business-workloads?e=48754805#:~:text=%E2%80%9CSilk%20has%20tested,D%20Officer%2C%20Silk"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Silk&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/c4d-vms-unparalleled-performance-for-business-workloads?e=48754805#:~:text=%22We%20are%20constantly,Engineer%2C%20Chess.com"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Chess.com&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; report greater than 40% improvement in performance with C4D over prior generations. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;But critical applications are only part of the story. A modern cloud architecture must also run countless general-purpose workloads where flexibility and price-performance are key. That’s why we designed N4D — as a complement to C4D. By leveraging C4D and N4D in tandem, you unlock the full spectrum of enterprise features, performance, flexibility, and cost-optimization, choosing:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;C4D for consistent performance:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; This is your solution for the most demanding, latency-sensitive applications. With up to 200 Gbps networking, Local SSD support along with larger shapes up to 384 vCPUs and bare metal options, C4D delivers predictable, high-end performance for large databases, high-traffic ad and game servers, and demanding AI/ML inference workloads.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;N4D for flexible cost-optimization:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; This is the engine for the vast majority of your general-purpose workloads. N4D’s leading price-performance, low cost, and flexibility allow you to slash TCO for applications like web servers, microservices, and development environments.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This approach is already delivering real-world results, allowing customers like Verve to optimize their business from both ends.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/verve.max-1000x1000.jpg"
        
          alt="verve"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="f72bn"&gt;"&lt;i&gt;With Google's Gen4 AMD portfolio, we can optimize for both revenue and cost simultaneously.&lt;/i&gt; &lt;b&gt;&lt;i&gt;C4D provides the consistent peak performance we need for our core ad servers&lt;/i&gt;&lt;/b&gt; &lt;i&gt;— 81% faster than C3D — which directly translates to more revenue from higher fill-rates (successful bid/ask matching). Meanwhile,&lt;/i&gt; &lt;b&gt;&lt;i&gt;N4D delivers an incredible 2x performance and price-performance over N2D for everyday workloads&lt;/i&gt;&lt;/b&gt;&lt;i&gt;, including scale-out microservices with GKE, enabling us to grow while slashing our overall TCO. This 'Better Together' strategy allows us to use the consistently peak performance of C4D for our mission-critical services and the flexible, cost-efficient N4D to aggressively reduce TCO everywhere else — a level of optimization that simply isn't possible with a single VM type elsewhere.” -&lt;/i&gt; Pablo Loschi, Principal Systems Engineer at Verve&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;The Custom Machine Type and Hyperdisk advantage&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Custom Machine Types are a key differentiator for Google Cloud, letting you go beyond predefined "T-shirt sizes". Instead of forcing your workload into a box, you can tailor the infrastructure to fit your workload's needs, saving on cost. For instance, a memory-intensive workload requiring 16 vCPUs and 70 GB of RAM might typically be placed on a predefined N4D-highmem-16 shape, forcing you to pay for unused resources. With CMTs, you provision the exact 16 vCPU and 70 GB configuration, eliminating that waste and achieving up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;17% cost savings&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With shapes of up to 96 vCPUs and 768 GB of DDR5 memory, the combination of Custom Machine Types and N4D lets you dial in the exact resources you need with flexible vCPU-to-memory ratios along with extended memory support. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/symbotic.max-1000x1000.jpg"
        
          alt="symbotic"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="f72bn"&gt;&lt;i&gt;“At Symbotic, our vision is to revolutionize the global supply chain with an AI-powered robotics platform built for scale and efficiency. This demands an infrastructure that is both powerful and scalable. Google Cloud's N4D VMs, powered by AMD's latest EPYC processors, delivered exactly that. We observed a&lt;/i&gt; &lt;b&gt;&lt;i&gt;significant 40% performance uplift&lt;/i&gt;&lt;/b&gt; &lt;i&gt;compared to the previous N2D generation, allowing us to cut&lt;/i&gt;&lt;b&gt;&lt;i&gt; our CPU footprint in half&lt;/i&gt;&lt;/b&gt; &lt;i&gt;with no change in simulation speed or fidelity.&lt;/i&gt; &lt;i&gt;The ability to pair these gains with Custom Machine Types&lt;/i&gt; &lt;i&gt;— a capability unique to Google Cloud — is a game-changer. It allows us to&lt;/i&gt; &lt;b&gt;&lt;i&gt;precisely sculpt our infrastructure to our workloads&lt;/i&gt;&lt;/b&gt;&lt;i&gt; and gain a significant TCO advantage versus other cloud offerings.”&lt;/i&gt; - Dan Inbar, Chief Information Officer, Symbotic&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This granular control and TCO advantage extends beyond compute to your storage. Just as Custom Machine Types let you break free from fixed vCPU-to-memory ratios, &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/disks/hyperdisks"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Hyperdisk&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; unbundles storage performance from capacity, letting you independently tune capacity and performance to precisely match your workload’s block storage requirements.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This is further enhanced by &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/storage-data-transfer/hyperdisk-storage-pools-is-now-generally-available"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Hyperdisk Storage Pools&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for Hyperdisk Balanced volumes, which let you provision performance and capacity in aggregate, rather than managing each volume individually. The result is simpler management, higher efficiency, an easier path for modernizing SAN workloads — all this while helping you lower your storage TCO by as much as &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/storage-data-transfer/hyperdisk-storage-pools-is-now-generally-available?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;30-50%&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Get started with N4D today&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Adopting the latest N4D VM series is easy, particularly if you use &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Kubernetes Engine (GKE)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, where our &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/adopt-new-vm-series-with-gke-compute-classes-flexible-cuds?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;custom compute classes&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; remove the operational hurdles of migrating workloads to new hardware. Just add N4D to your prioritized list of VM types to ensure your workloads have the performance and flexibility they need to scale.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4D is now available in us-central1 (Iowa), us-east1 (South Carolina), us-west1 (Oregon), us-west4 (Las Vegas), europe-west1 (Belgium), and europe-west4 (Netherlands). &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Check for the latest availability on our&lt;/span&gt; &lt;a href="https://cloud.google.com/compute/docs/regions-zones#available"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Regions and Zones page&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and deploy your first instance today in the &lt;/span&gt;&lt;a href="https://console.cloud.google.com/"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud console&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or with GKE. Learn more about N4D details here in &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/general-purpose-machines#n4d_series"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;1. 9xx5C-044 - Testing by AMD Performance Labs as of 10/21/2025. N4D-standard-16 score comparison to N2D-standard-16 running FFmpeg v6.1.1 benchmark (average of 2x encode and 2x transcode) on Ubuntu24.04LTS OS with 6.8.0-1021-gcp kernel, SMT On.&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Performance uplift (normalized to N2D):&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Ffmpeg_raw_vp9&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;                  1.76&lt;br/&gt;&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Ffmpeg_h264_vp9&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;               1.76&lt;br/&gt;&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Ffmpeg_raw_h264&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;               1.71&lt;br/&gt;&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Ffmpeg_vp9_h264&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;               1.76&lt;br/&gt;&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;FFmpeg average&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;                  1.75&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud performance results presented are based on the test date in the configuration. Results may vary due to changes to the underlying configuration, and other conditions such as the placement of the VM and its resources, optimizations by the cloud service provider, accessed cloud regions, co-tenants, and the types of other workloads exercised at the same time on the system&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 10 Nov 2025 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/n4d-vms-based-on-amd-turin-now-ga/</guid><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>N4D now GA: Gain up to 3.5x price-performance for scale-out workloads</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/n4d-vms-based-on-amd-turin-now-ga/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Sarthak Sharma</name><title>Product Manager</title><department></department><company></company></author></item><item><title>Announcing Ironwood TPUs General Availability and new Axion VMs to power the age of inference</title><link>https://cloud.google.com/blog/products/compute/ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today’s frontier models, including Google’s Gemini, Veo, Imagen, and Anthropic’s Claude train &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;and serve o&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;n Tensor Processing Units (TPUs). For many organizations, the focus is shifting from training these models to powering useful, responsive interactions with them. Constantly shifting model architectures, the rise of agentic workflows, plus near-exponential growth in demand for compute, define this new &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;age of inference&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. In particular, agentic workflows that require orchestration and tight coordination between general-purpose compute and ML acceleration are creating new opportunities for custom silicon and vertically co-optimized system architectures. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We have been preparing for this transition for some time and today, we are announcing the availability of three new products built on custom silicon that deliver exceptional performance, lower costs, and enable new capabilities for inference and agentic workloads:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Ironwood&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;our seventh generation TPU, will be generally available in the coming weeks&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. Ironwood is purpose-built for the most demanding workloads: from large-scale model training and complex reinforcement learning (RL) to high-volume, low-latency AI inference and model serving. It offers a 10X peak performance improvement over &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;TPU v5p and &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;more than 4X better performance per chip for both training and inference workloads compared to TPU v6e (Trillium), making Ironwood our most powerful and energy-efficient custom silicon to date.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong style="vertical-align: baseline;"&gt;New Arm&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;®&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;-based Axion instances. N4A&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, our most cost-effective N series virtual machine to date, is &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;now in preview&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. N4A offers up to 2x better price-performance than comparable current-generation x86-based VMs. We are also pleased to announce &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;C4A metal&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;,&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;our first Arm-based bare metal instance&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;, &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;will be &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;coming soon in preview.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-video"&gt;



&lt;div class="article-module article-video "&gt;
  &lt;figure&gt;
    &lt;a class="h-c-video h-c-video--marquee"
      href="https://youtube.com/watch?v=aQxcomQDHcw"
      data-glue-modal-trigger="uni-modal-aQxcomQDHcw-"
      data-glue-modal-disabled-on-mobile="true"&gt;

      
        

        &lt;div class="article-video__aspect-image"
          style="background-image: url(https://storage.googleapis.com/gweb-cloudblog-publish/images/Ironwood.max-1000x1000.jpg);"&gt;
          &lt;span class="h-u-visually-hidden"&gt;youtube video&lt;/span&gt;
        &lt;/div&gt;
      
      &lt;svg role="img" class="h-c-video__play h-c-icon h-c-icon--color-white"&gt;
        &lt;use xlink:href="#mi-youtube-icon"&gt;&lt;/use&gt;
      &lt;/svg&gt;
    &lt;/a&gt;

    
  &lt;/figure&gt;
&lt;/div&gt;

&lt;div class="h-c-modal--video"
     data-glue-modal="uni-modal-aQxcomQDHcw-"
     data-glue-modal-close-label="Close Dialog"&gt;
   &lt;a class="glue-yt-video"
      data-glue-yt-video-autoplay="true"
      data-glue-yt-video-height="99%"
      data-glue-yt-video-vid="aQxcomQDHcw"
      data-glue-yt-video-width="100%"
      href="https://youtube.com/watch?v=aQxcomQDHcw"
      ng-cloak&gt;
   &lt;/a&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ironwood and these new Axion instances are just the latest in a long history of custom silicon innovation at Google, including TPUs, Video Coding Units (VCU) for YouTube, and five generations of Tensor chips for mobile. In each case, we build these processors to enable breakthroughs in performance that are only possible through deep, system-level co-design, with model research, software, and hardware development under one roof. This is how we built the first TPU ten years ago, which in turn unlocked the invention of the Transformer eight years ago — the very architecture that powers most of modern AI. It has also influenced more recent advancements like our &lt;/span&gt;&lt;a href="https://cloud.google.com/titanium"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; architecture, and advanced &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/systems/enabling-1-mw-it-racks-and-liquid-cooling-at-ocp-emea-summit?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;liquid cooling&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; that we’ve deployed at GigaWatt scale with fleet-wide uptime of ~99.999% since 2020.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_E4cJ2SM.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="wdacc"&gt;Pictured: An Ironwood board showing three Ironwood TPUs connected to liquid cooling.&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_BWW5xwl.max-1000x1000.jpg"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="wdacc"&gt;Pictured: Third-generation Cooling Distribution Units, providing liquid cooling to an Ironwood superpod.&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Ironwood: The fastest path from model training to planet-scale inference&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The early response to Ironwood is &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;overwhelmingly enthusiastic. Anthropic is compelled by the impressive price-performance gains that accelerate their path from training massive Claude models to serving them to millions of users. In fact, Anthropic &lt;/span&gt;&lt;a href="https://www.googlecloudpresscorner.com/2025-10-23-Anthropic-to-Expand-Use-of-Google-Cloud-TPUs-and-Services" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;plans to access up to 1 million TPUs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/Anthropic.max-1000x1000.jpg"
        
          alt="Anthropic"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="fembv"&gt;&lt;i&gt;"Our customers, from Fortune 500 companies to startups, depend on Claude for their most critical work. As demand continues to grow exponentially, we're increasing our compute resources as we push the boundaries of AI research and product development. Ironwood’s improvements in both inference performance and training scalability will help us scale efficiently while maintaining the speed and reliability our customers expect."&lt;/i&gt; – &lt;b&gt;James Bradbury, Head of Compute, Anthropic&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ironwood is being used by organizations of all sizes and across industries:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/lightricks.max-1000x1000.jpg"
        
          alt="lightricks"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="fembv"&gt;&lt;i&gt;“Our mission at Lightricks is to define the cutting edge of open creativity, and that demands AI infrastructure that eliminates friction and cost at scale. We relied on Google Cloud TPUs and its massive ICI domain to achieve our breakthrough training efficiency for LTX-2, our leading open-source multimodal generative model. Now, as we enter the age of inference, our early testing makes us highly enthusiastic about Ironwood. We believe that Ironwood will enable us to create more nuanced, precise, and higher-fidelity image and video generation for our millions of global customers."&lt;/i&gt; - &lt;b&gt;Yoav HaCohen, PhD, Director of Foundational Generative AI Research, Lightricks&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/essential_ai.max-1000x1000.jpg"
        
          alt="essential ai"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="fembv"&gt;&lt;i&gt;“At Essential AI, our mission is to build powerful, open frontier models. We need massive, efficient scale, and Google Cloud's Ironwood TPUs deliver exactly that. The platform was incredibly easy to onboard, allowing our engineers to immediately leverage its power and focus on accelerating AI breakthroughs."&lt;/i&gt; - &lt;b&gt;Philip Monk, Infrastructure Lead, Essential AI&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;System-level design maximizes inference performance, reliability, and cost &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;TPUs are a key component of &lt;/span&gt;&lt;a href="https://cloud.google.com/solutions/ai-hypercomputer"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AI Hypercomputer&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, our integrated supercomputing system that brings together compute, networking, storage, and software to improve system-level performance and efficiency. At the macro level, according to a recent IDC report, AI Hypercomputer customers achieved on average 353% three-year ROI, 28% lower IT costs, and 55% more efficient IT teams.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ironwood TPUs will help customers push the limits of scale and efficiency even further. When you deploy TPUs, the system connects each individual chip to each other, creating a pod — allowing the interconnected TPUs to work as a single unit. With Ironwood, we can scale up to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;9,216 chips in a superpod&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; linked with breakthrough Inter-Chip Interconnect (ICI) networking at 9.6 Tb/s. This massive connectivity allows thousands of chips to quickly communicate with each other and access a staggering 1.77 Petabytes of shared High Bandwidth Memory (HBM), overcoming data bottlenecks for even the most demanding models.&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_WZEo7he.max-1000x1000.png"
        
          alt="TPU"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="wdacc"&gt;Pictured: Part of an Ironwood superpod, directly connecting 9,216 Ironwood TPUs in a single domain.&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At that scale, services demand uninterrupted availability. That’s why our Optical Circuit Switching (OCS) technology acts as a dynamic, reconfigurable fabric, instantly routing around interruptions to restore the workload while your services keep running. And when you need more power, Ironwood scales across pods into clusters of hundreds of thousands of TPUs.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/4_fFI906U.max-1000x1000.png"
        
          alt="4"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="wdacc"&gt;Pictured: Jupiter data center network enables the connection of multiple Ironwood superpods into clusters of hundreds of thousands of TPUs.&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The AI Hypercomputer advantage: Hardware and software co-designed for faster, more efficient outcomes&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;On top of this hardware is a co-designed software layer, where our goal is to maximize Ironwood’s massive processing power and memory, and make it easy to use throughout the AI lifecycle. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;To improve fleet efficiency and operations, we’re excited to announce that TPU customers can now benefit from &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Cluster Director capabilities&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; in Google Kubernetes Engine. This includes advanced maintenance and topology awareness for intelligent scheduling and highly resilient clusters.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;For pre-training and post-training, we’re also sharing&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; new enhancements to &lt;/strong&gt;&lt;a href="https://maxtext.readthedocs.io/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;MaxText&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;, &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;a high-performance, open source LLM framework, to make it easier to implement training and reinforcement learning optimization techniques, such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;For inference, we recently announced enhanced support for TPUs in &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/in-q3-2025-ai-hypercomputer-adds-vllm-tpu-and-more"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;vLLM&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, allowing developers to switch between GPUs and TPUs, or run both, with only a few minor configuration changes, and &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GKE Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which intelligently load balances across TPU servers to reduce time-to-first-token (TTFT) latency by up to 96% and serving costs by up to 30%.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our software layer is what enables AI Hypercomputer’s high performance and reliability for training, tuning, and serving demanding AI workloads at scale. Thanks to deep integrations across the stack — from data-center-wide hardware optimizations to open software and managed services— Ironwood TPUs are our most powerful and energy-efficient TPUs to date. Learn more about our approach to hardware and software co-design &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.  &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Axion: Redefining general-purpose compute &lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Building and serving modern applications requires both highly specialized accelerators and powerful, efficient general-purpose compute. This was our vision for Axion, our custom Arm Neoverse®-based CPUs, which we designed to deliver compelling performance, cost and energy efficiency for everyday workloads. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we are expanding our Axion portfolio with:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;N4A&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;preview&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;), our second general-purpose Axion VM, which is ideal for microservices, containerized applications, open-source databases, batch, data analytics, development environments, experimentation, data preparation and web serving jobs that make AI applications possible. Learn more about N4A &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/axion-based-n4a-vms-now-in-preview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong style="vertical-align: baseline;"&gt;C4A metal (in preview soon), &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;our first Arm-based bare-metal instance, which provides dedicated physical servers for specialized workloads such Android development, automotive in-car systems, software with strict licensing requirements, scale test farms, or running complex simulations. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Learn more about C4A metal &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/new-axion-c4a-metal-offers-bare-metal-performance-on-arm"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/5_nH8lIVk.max-1000x1000.png"
        
          alt="5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With today's announcements, the Axion portfolio now includes three powerful options, N4A, C4A and C4A metal. Together, the C and N series allow you to lower the total cost of running your business without compromising on performance or workload-specific requirements.&lt;br/&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;div align="center"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;&lt;table style="width: 100%;"&gt;&lt;colgroup&gt;&lt;col style="width: 23.7818%;"/&gt;&lt;col style="width: 21.0548%;"/&gt;&lt;col style="width: 55.1634%;"/&gt;&lt;/colgroup&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Axion-based Instance&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Optimized for&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p style="text-align: center;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Key Features&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;N4A (preview)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Price-performance and flexibility&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Up to 64 vCPUs, 512GB of DDR5 Memory, and 50 Gbps networking, with support for Custom Machine Types, Hyperdisk Balanced and Throughput storage.&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C4A Metal (in preview soon) &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Specialized workloads, such as Hypervisors and native Arm development&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Up to 96 vCPUs, 768GB of DDR5 Memory, Hyperdisk storage and up to 100Gbps of networking &lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C4A&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Consistently high performance&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Up to 72 vCPUs, 576GB of DDR5 Memory, 100Gbps of Tier 1 networking, Titanium SSD with up to 6TB of local capacity, advanced maintenance controls and support for Hyperdisk Balanced, Throughput, and Extreme.&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Axion’s inherent efficiency also makes it a valuable option for modern AI workflows. While specialized accelerators like Ironwood handle the complex task of model serving, Axion excels at the operational backbone: supporting high-volume data preparation, ingestion, and running application servers that host your intelligent applications. Axion is already translating into customer impact:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/4_ZB4gdHF.max-1000x1000.jpg"
        
          alt="4"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="fembv"&gt;&lt;i&gt;"At Vimeo, we have long relied on Custom Machine Types to efficiently manage our massive video transcoding platform. Our initial tests on the new Axion-based N4A instances have been very compelling, unlocking a new level of efficiency. We've observed a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs. This points to a clear path for improving our unit economics and scaling our services more profitably, without changing our operational model."&lt;/i&gt; - &lt;b&gt;Joe Peled, Sr. Director of Hosting &amp;amp; Delivery Ops, Vimeo&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_3I8oyl8.max-1000x1000.jpg"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="fembv"&gt;&lt;i&gt;"At ZoomInfo, we operate a massive data intelligence platform where efficiency is paramount. Our core data processing pipelines, which are critical for delivering timely insights to our customers, run extensively on Dataflow and Java services in GKE. In our preview of the new N4A instances, we measured a 60% improvement in price-performance for these key workloads compared to their x86-based counterparts. This allows us to scale our platform more efficiently and deliver more value to our customers, faster." -&lt;/i&gt; &lt;b&gt;Sergei Koren, Chief Infrastructure Architect, ZoomInfo&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/5_m4GINGe.max-1000x1000.jpg"
        
          alt="5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="fembv"&gt;&lt;i&gt;"Migrating to Google Cloud's Axion portfolio gave us a critical competitive advantage. We slashed our compute consumption by 20% while maintaining low and stable latency with C4A instances, such as our Supply-Side Platform (SSP) backend service. Additionally, C4A enabled us to leverage Hyperdisk with precisely the IOPS we need for our stateful workloads, regardless of instance size. This flexibility gives us the best of both worlds - allowing us to win more ad auctions for our clients while significantly improving our margins. We're now testing the N4A family by running some of our key workloads that require the most flexibility, such as our API relay service. We are happy to share that several applications running in production are consuming 15% less CPU compared to our previous infrastructure, reducing our costs further, while ensuring that the right instance backs the workload characteristics required.”&lt;/i&gt; - &lt;b&gt;Or Ben Dahan, Cloud &amp;amp; Software Architect, Rise&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;A powerful combination for AI and everyday computing&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To thrive in an era with constantly shifting model architectures, software, and techniques, you need a combination of &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;purpose-built AI accelerators&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for model training and serving, alongside &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;efficient, general-purpose CPUs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for the everyday workloads, including the workloads that support those AI applications. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ultimately, whether you use Ironwood and Axion together or mix and match them with the other &lt;/span&gt;&lt;a href="https://cloud.google.com/products/compute"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;compute options&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; available on AI Hypercomputer, this system-level approach gives you the ultimate flexibility and capability for the most demanding workloads. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Sign up to test &lt;/strong&gt;&lt;a href="https://cloud.google.com/resources/ironwood-tpu-interest?utm_source=cgc-blog&amp;amp;utm_medium=blog&amp;amp;utm_campaign=FY25-Q2-global-ENT33820-website-cs-ironwood-tpu-interest&amp;amp;utm_content=ironwood_announcement_blog&amp;amp;utm_term=ironwood"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Ironwood&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;, &lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt;Axion &lt;/strong&gt;&lt;a href="https://forms.gle/HYY5FWRKewYuDMB27" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;N4A&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;, or &lt;/strong&gt;&lt;a href="https://forms.gle/tzYAWwMBBhkkR4yHA" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;C4A metal&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; today.&lt;/strong&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 06 Nov 2025 13:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_WZEo7he.max-600x600.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Announcing Ironwood TPUs General Availability and new Axion VMs to power the age of inference</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/3_WZEo7he.max-600x600.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Amin Vahdat</name><title>SVP and Chief Technologist, AI and Infrastructure</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mark Lohmeyer</name><title>VP and GM, AI and Computing Infrastructure</title><department></department><company></company></author></item><item><title>Announcing Axion C4A metal: Arm-based Axion instances for specialized use cases</title><link>https://cloud.google.com/blog/products/compute/new-axion-c4a-metal-offers-bare-metal-performance-on-arm/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we are thrilled to announce C4A metal, our first bare metal instance running on Google Axion processors, available in preview soon. C4A metal is designed for specialized workloads that require direct hardware access and Arm®-native compatibility. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now, organizations running environments such as Android development, automotive simulation, CI/CD pipelines, security workloads, and custom hypervisors can run them on Google Cloud, without the performance overheads and complexity of nested virtualization.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C4A metal instances, like other Axion instances, are built on the standard Arm architecture, so your applications and operating systems compiled for Arm remain portable across your cloud, on-premises, and edge environments, protecting your development investment. C4A metal offers 96 vCPUs, 768GB of DDR5 memory, up to 100Gbps of networking bandwidth, with full support for Google Cloud Hyperdisk including Hyperdisk Balanced, Extreme, Throughput, and ML block storage options.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Google Cloud provides workload-optimized infrastructure to ensure the right resources are available for every task. C4A metal, like the &lt;/span&gt;&lt;a href="https://cloud.google.com/products/axion?e=48754805&amp;amp;hl=en"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud Axion virtual machine family&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, is powered by &lt;/span&gt;&lt;a href="https://cloud.google.com/titanium"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Titanium&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, a key component for multi-tier offloads and security that is foundational to our infrastructure. Titanium's custom-designed silicon offloads networking and storage processing to free up the CPU, and its dedicated SmartNIC manages all I/O, ensuring that Axion cores are reserved exclusively for your application's performance. Titanium is part of Google Cloud’s vertically integrated software stack — from the custom silicon in our servers to our planet-scale network traversing &lt;/span&gt;&lt;a href="https://cloud.google.com/about/locations"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;7.75 million kilometers of terrestrial and subsea fiber&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; across 42 regions — that is engineered to maximize efficiency and provide the ultra-low latency and high bandwidth to customers at global scale.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Architectural parity for automotive workloads&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Automotive customers can benefit from the Arm architecture’s performance, efficiency, and flexible design for in-vehicle systems such as infotainment and Advanced Driver Assistance Systems (ADAS). Axion C4A metal instances enable architectural parity between test environments and production silicon, allowing automotive technology providers to validate their software on the same Arm Neoverse instruction set architecture (ISA) used in production electronic control units (ECUs). This significantly reduces the risk of late-stage integration failures. For performance-sensitive tasks, these customers can execute demanding virtual hardware-in-the-loop (vHIL) simulations with the consistent, low-latency performance of physical hardware, ensuring test results are reliable and accurate. Finally, C4A metal lets providers move beyond the constraints of a physical lab, by dynamically scaling entire test farms and transforming them from fixed capital expenses into flexible operational ones.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_nDU2gjP.max-1000x1000.jpg"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="clg9v"&gt;&lt;i&gt;“In the era of AI-defined vehicles, the accelerating pace and complexity of technology are pushing us to rethink traditional linear approaches to software development. Google Cloud’s introduction of Axion C4A metal is a major step forward in this journey. By offering full architectural parity on Arm between test environments and physical silicon, customers can benefit from accelerated development cycles, enabling continuous integration and compliance for a variety of specialized use cases."&lt;/i&gt; - &lt;b&gt;Dipti Vachani, Senior Vice President and General Manager, Automotive Business, Arm&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/qnx.max-1000x1000.jpg"
        
          alt="qnx"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="clg9v"&gt;&lt;i&gt;“Our partners and customers rely on QNX to deliver the safety, security, reliability, and real-time performance required for their most mission-critical systems — from advanced driver assistance to digital cockpits. As the Software-Defined Vehicle era continues to gain momentum, decoupling software development from physical hardware is no longer optional — it’s essential for innovation at scale. The launch of Google Cloud’s C4A-metal instances on Axion introduces a powerful ARM-based bare metal platform that we are eager to test and support as this will enable transformative cloud infrastructure benefits for our automotive ecosystem.” -&lt;/i&gt; &lt;b&gt;Grant Courville, Senior Vice President, Products and Strategy, QNX&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/qualcomm.max-1000x1000.jpg"
        
          alt="qualcomm"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="clg9v"&gt;&lt;i&gt;“The future of automotive mobility demands unprecedented speed and precision in practice and development. For automakers and suppliers leveraging the Snapdragon Digital Chassis platform, aligning their cloud development and testing environments to ensure parity with the Snapdragon SoCs in the vehicle is absolutely crucial for efficiency and quality. We are excited about Google Cloud’s commitment to this segment — offering C4A-metal instances with Axion is a massive leap forward, giving the automotive ecosystem a true 1:1 physical to virtual environment in the cloud. This breakthrough significantly reduces integration challenges, slashes validation time, and allows our partners to unleash AI-driven features to market faster at scale.”&lt;/i&gt; - &lt;b&gt;Laxmi Rayapudi, VP, Product Management, Qualcomm Technologies, Inc.&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Align test and production for Android development&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The Android platform was built for Arm-based processors, the standard for virtually all mobile devices. By running development and testing pipelines on the bare-metal instances of Axion processors with C4A metal, Android developers can benefit from native performance, eliminating the overhead of emulation management, such as slow instruction-by-instruction translation layers. In addition, they can significantly reduce latency for Android build toolchains and automated test systems, leading to faster feedback cycles. C4A metal also solves the performance challenges of nested virtualization, making it a great platform for scalable Cuttlefish (Cloud Android) environments. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Once available, developers can deploy scalable Cuttlefish environment farms on top C4A metal instances with an &lt;/span&gt;&lt;a href="https://github.com/googlecloudplatform/horizon-sdv" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;upcoming release of Horizon&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or by directly leveraging &lt;/span&gt;&lt;a href="https://github.com/google/cloud-android-orchestration/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Android Orchestration&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. C4A metal allows these virtual devices to run directly on the physical hardware, providing the performance needed to build and manage large, high-fidelity test farms for true continuous testing.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Bare metal access without compromise&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As a cloud offering, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;C4A metal enables a lower total cost of ownership by replacing the entire lifecycle of physical hardware procurement and management with a predictable operational expense. This eliminates the direct capital expenditures of purchasing servers, along with the associated operational costs of hardware maintenance contracts, power, cooling, and physical data center space. You can programmatically provision and de-provision instances to match your exact testing demands, ensuring you are not paying for an over-provisioned fleet of servers sitting idle waiting for peak development cycles.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Operating as standard compute resources within your Virtual Private Cloud (VPC), C4A metal instances inherit and leverage the same security policies, audit logging, and network controls as virtual machines. Instances are designed to appear as physical servers to your toolchain and support common monitoring and security agents, allowing for straightforward integration with your existing Google Cloud environments. This integration extends to storage, where network-attached Hyperdisk allows you to manage persistent disks using the same snapshot and resizing tools your teams already use for your virtual machine fleet.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-paragraph_with_image"&gt;&lt;div class="article-module h-c-page"&gt;
  &lt;div class="h-c-grid uni-paragraph-wrap"&gt;
    &lt;div class="uni-paragraph
      h-c-grid__col h-c-grid__col--8 h-c-grid__col-m--6 h-c-grid__col-l--6
      h-c-grid__col--offset-2 h-c-grid__col-m--offset-3 h-c-grid__col-l--offset-3"&gt;

      






  

    &lt;figure class="article-image--wrap-small
      
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/chainguard.max-1000x1000.jpg"
        
          alt="chainguard"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  





      &lt;p data-block-key="clg9v"&gt;&lt;i&gt;“For our build system, true isolation is paramount. Running on Google Cloud’s new C4A metal instance on Axion enables us to isolate our package builds with a strong hypervisor security boundary without compromising on build performance."&lt;/i&gt; - &lt;b&gt;Matthew Moore, Founder and CTO, Chainguard, Inc&lt;/b&gt;&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Better together: the Axion C and N series&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The addition of C4A metal to the Arm-based Axion portfolio allows customers to lower TCO by matching the right infrastructure to every workload. While Axion &lt;/span&gt;&lt;a href="https://cloud.google.com/compute/docs/general-purpose-machines#c4a_series"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;C4A virtual machines&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; optimize for consistently high performance and &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/axion-based-n4a-vms-now-in-preview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;N4A virtual machines&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (now in preview) optimize for price-performance and flexibility, C4A metal addresses the critical need for direct hardware access by specialized applications that require a non-virtualized Arm environment.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For example, an Android development company could create a highly efficient CI/CD pipeline by using C4A virtual machines for the build farm. For large-scale testing, they could use C4A metal to run Cuttlefish virtual devices directly on the physical hardware, eliminating nested virtualization overhead. To enable even higher fidelity, they can run Cuttlefish hybrid devices on C4A metal, reusing the system images from their physical hardware. Concurrently, supporting infrastructure such as CI/CD orchestrators and artifact repositories could run on cost-effective N4A instances, using Custom Machine Types to right-size resources and minimize operational expenses.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Coming soon to preview&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;C4A metal is scheduled for preview soon. Please fill &lt;/span&gt;&lt;a href="https://docs.google.com/forms/d/1iPfHMoGBHVDs_5zXohLCXjJWyEVASEjA2BZLqd3mtsI/edit#responses" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;this form&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to sign up for early access and additional updates. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 06 Nov 2025 13:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/new-axion-c4a-metal-offers-bare-metal-performance-on-arm/</guid><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Announcing Axion C4A metal: Arm-based Axion instances for specialized use cases</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/new-axion-c4a-metal-offers-bare-metal-performance-on-arm/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Yarden Halperin</name><title>Product Manager, Google Cloud</title><department></department><company></company></author></item><item><title>From silicon to softmax: Inside the Ironwood AI stack</title><link>https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As machine learning models continue to scale, a specialized, co-designed hardware and software stack is no longer optional, it’s critical. &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Ironwood&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, our latest generation Tensor Processing Unit (TPU), is the cutting-edge hardware behind advanced models like Gemini and Nano Banana, from massive-scale training to high-throughput, low-latency inference. This blog details the core components of Google's AI software stack that are woven into Ironwood, demonstrating how this deep co-design unlocks performance, efficiency, and scale. We cover the JAX and PyTorch ecosystems, the XLA compiler, and the high-level frameworks that make this power accessible.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;1. The co-designed foundation&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Foundation models today have trillions of parameters that require computation at ultra-large scale. We designed the Ironwood stack from the silicon up to meet this challenge.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The core philosophy behind the Ironwood stack is system-level co-design, treating the entire TPU pod not as a collection of discrete accelerators, but as a single, cohesive supercomputer. This architecture is built on a custom interconnect that enables massive-scale Remote Direct Memory Access (RDMA), allowing thousands of chips to exchange data directly at high bandwidth and low latency, bypassing the host CPU. Ironwood has a total of 1.77 PB of directly accessible HBM capacity, where each chip has eight stacks of HBM3E, with a peak HBM bandwidth of 7.4 TB/s and capacity of 192 GiB.  &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Unlike general-purpose parallel processors,TPUs are Application-Specific Integrated Circuits (ASICs) built for one purpose: accelerating large-scale AI workloads. The deep integration of compute, memory, and networking is the foundation of their performance. At a high level, the TPU consists of two parts:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hardware core&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The TPU core is centered around a dense&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; Matrix Multiply Unit (MXU)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for matrix operations, complemented by a powerful &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Vector Processing Unit (VPU)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for element-wise operations (activations, normalizations) and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;SparseCores&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for scalable embedding lookups. This specialized hardware design is what delivers Ironwood's 42.5 Exaflops of FP8 compute.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Software target&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This hardware design is explicitly targeted by the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Accelerated Linear Algebra (XLA) compiler&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, using a software co-design philosophy that &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;combines the broad benefits of whole-program optimization with the precision of hand-crafted custom kernels. &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;XLA's compiler-centric approach provides a powerful performance baseline by fusing operations into optimized kernels that saturate the MXU and VPU. This approach delivers good "out of the box" performance with broad framework and model support. This general-purpose optimization is then complemented by custom kernels &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;(detailed below in the Pallas section)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to achieve peak performance on specific model-hardware combinations. This dual-pronged strategy is a fundamental tenet of the co-design.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The figure below shows the layout of the Ironwood chip:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Z5xATZ3.max-1000x1000.png"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This specialized design extends to the connectivity between TPU chips for massive scale-up and scale-out for a total of 88473.6 Tbps (11059.2TB/s) for a complete Ironwood superpod. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The building block: Cubes and ICI.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Each physical Ironwood host has four TPU chips. A single rack of these hosts has 64 Ironwood chips and forms a “cube”. Within this cube, every chip is connected via multiple high-speed &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Inter-Chip Interconnect (ICI)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; links that form a direct 3D Torus topology. This creates an extremely dense, all-to-all network fabric, enabling massive bandwidth and low latency for distributed operations within the cube.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Scaling with OCS: Pods and Superpods&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; To scale beyond a single cube, multiple cubes are connected using an &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Optical Circuit Switch (OCS) &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;network.&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;This is&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;a dynamic, reconfigurable optical network that connects entire cubes, allowing the system to scale from a small "pod" (e.g., a 256-chip Ironwood pod with four cubes) to a massive "superpod" (e.g., a 9,216-chip system with 144 cubes). This OCS-based topology is key to fault tolerance. If a cube or link fails, the OCS fabric manager instructs the OCS to optically bypass the unhealthy unit and establish new, complete optical circuits connecting only the healthy cubes, swapping in a designated spare. This dynamic reconfigurability allows for both resilient operation and the provisioning of efficient "slices" of any size. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;For the largest-scale systems, into the hundreds of thousands of chips, multiple superpods can then be connected via a standard Data-Center Network (DCN).&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Chips can be configured in different “slices” with different OCS topologies as shown below.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_VdZkL7j.max-1000x1000.jpg"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Each chip is connected to 6 other chips in the 3D torus and provides 3 distinct axes for parallelism. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_KvozMKZ.max-1000x1000.png"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ironwood delivers this performance while focusing on power efficiency, allowing AI workloads to run more cost-effectively. Ironwood perf/watt is 2x relative to Trillium, our previous-generation TPU. Our advanced liquid cooling solutions and optimized chip design can reliably sustain up to twice the performance of standard air cooling even under continuous, heavy AI workloads. Ironwood is nearly 30x more power efficient than our first Cloud TPU from 2018 and is our most power-efficient chip to date. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/4_UxXCPJg.max-1000x1000.jpg"
        
          alt="4"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;It’s the software stack's job to translate high-level code into optimized instructions that leverage the full power of the hardware. The stack supports two primary frameworks: the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;JAX&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; ecosystem, which offers maximum performance and flexibility, as well as &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;PyTorch&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; on TPUs, which provides a native experience for the PyTorch community.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Optimizing the entire AI lifecycle&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We use the principle of a co-designed Ironwood hardware and software stack to deliver maximum performance and efficiency across every phase of model development, with specific hardware and software capabilities tuned for each stage.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Pre-training&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This phase demands sustained, massive-scale computation. A &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;full 9,216-chip Ironwood superpod&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; leverages the OCS and ICI fabric to operate as a single, massive parallel processor, achieving maximum sustained FLOPS utilization through different data formats. Running a job of this magnitude also requires resilience, which is managed by high-level software frameworks like &lt;/span&gt;&lt;a href="https://maxtext.readthedocs.io/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;MaxText&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, detailed in Section 3.3, that handle fault tolerance and checkpointing transparently.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Post-training (Fine-tuning and alignment)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This stage includes diverse, FLOPS-intensive tasks like supervised fine-tuning (SFT) and Reinforcement Learning (RL), all requiring rapid iteration. RL, in particular, introduces complex, heterogeneous compute patterns. This stage often requires two distinct types of jobs to run concurrently: &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;high-throughput, inference-like sampling&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to generate new data (often called 'actor rollouts'), and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;compute-intensive, training-like 'learner' steps&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; that perform the gradient-based updates. Ironwood’s high-throughput, low-latency network and flexible OCS-based slicing are ideal for this type of rapid experimentation, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;efficiently managing the different hardware demands of both sampling and gradient-based updates&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. In Section 3.3, we discuss how we provide optimized software on Ironwood — including reference implementations and libraries — to make these complex fine-tuning and alignment workflows easier to manage and execute efficiently.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Inference (serving)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: In production, models must deliver low-latency predictions with high throughput and cost-efficiency. Ironwood is specifically engineered for this, with its large on-chip memory and compute power optimized for both the large-batch "prefill" phase and the memory-bandwidth-intensive "decode" phase of large generative models. To make this power easily accessible, we’ve optimized state-of-the-art serving engines. At launch, we’ve enabled &lt;/span&gt;&lt;a href="https://blog.vllm.ai/2025/10/16/vllm-tpu.html" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;vLLM&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, detailed in Section 3.3, providing the community with a top-tier, open-source solution that maximizes inference throughput on Ironwood.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;3. The software ecosystem for TPUs&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The TPU stack, and Ironwood’s stack in particular, is designed to be modular, allowing developers to operate at the level of abstraction they need. In this section, we focus on the compiler/runtime, framework, and AI stack libraries.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3.1 The JAX path: Performance and composability&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;JAX is a high-performance numerical computing system co-designed with the TPU architecture. It provides a familiar NumPy-like API backed by powerful function transformations:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;jit&lt;/strong&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;(Just-in-Time compilation)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Uses the XLA compiler to fuse operations into a single, optimized kernel for efficient TPU execution.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;grad&lt;/strong&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;(automatic differentiation)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Automatically computes gradients of Python functions, the fundamental mechanism for model training.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;shard_map&lt;/strong&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;(parallelism)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The primitive for expressing distributed computations, allowing explicit control over how functions and data are sharded across a mesh of TPU devices, directly mapping to the ICI/OCS topology.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This compositional approach allows developers to write clean, Pythonic code that JAX and XLA transform into highly parallelized programs optimized for TPU hardware. JAX is what Google Deepmind and other Google teams use to build, train, and service their variety of models. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For most developers, these primitives are abstracted by high-level frameworks, like &lt;/span&gt;&lt;a href="https://maxtext.readthedocs.io/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;MaxText&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, built upon a foundation of composable, production-proven libraries:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://optax.readthedocs.io/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Optax&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;: A flexible gradient processing and optimization library (e.g., AdamW)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://orbax.readthedocs.io/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Orbax&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;: A library for asynchronous checkpointing of distributed arrays across large TPU slices&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://qwix.readthedocs.io/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Qwix&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;: A JAX quantization library supporting Quantization Aware Training (QAT) and Post-Training Quantization (PTQ)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://metrax.readthedocs.io/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Metrax&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;: A library for collecting and processing evaluation metrics in a distributed setting&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://github.com/google/tunix" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Tunix&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;: A high-level library for orchestrating post-training jobs&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://github.com/AI-Hypercomputer/ml-goodput-measurement" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Goodput&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;: A library for measuring and monitoring real-time ML training efficiency, providing a detailed breakdown of badput (e.g., initialization, data loading, checkpointing)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3.2 The PyTorch path: A native eager experience&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To bring Ironwood's power to the PyTorch community, we are developing a new, native PyTorch experience complete with support for a “native eager mode”, which executes operations immediately as they are called. Our goal is to provide a more natural and developer-friendly way to access Ironwood's scale, minimizing the code changes and level of effort required to adapt models for TPUs. This approach is designed to make the transition from local experimentation to large-scale training more straightforward.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This new framework is built on three core principles to ensure a truly PyTorch-native environment:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Full eager mode:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Enables the rapid prototyping, debugging, and research workflows that developers expect from PyTorch.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Standard distributed APIs:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Leverages the familiar &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;torch.distributed&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; API, built on &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;DTensor&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, for scaling training workloads across TPU slices.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Idiomatic compilation:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Uses &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;torch.compile&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; as the single, unified path to JIT compilation, utilizing XLA as its backend to trace the graph and compile it into efficient TPU machine code.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This ensures the transition from local experimentation to large-scale distributed training is a natural extension of the standard PyTorch workflow. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3.3 Frameworks: MaxText, PyTorch on TPU, and vLLM&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;While JAX and PyTorch provide the computational primitives, scaling to thousands of chips is a supercomputer management problem. High-level frameworks handle the complexities of resilience, fault tolerance, and infrastructure orchestration.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://maxtext.readthedocs.io/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;MaxText&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; (JAX)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: MaxText is an open-source, high-performance LLM pre-training and post-training solution written in pure Python and JAX. MaxText demonstrates optimized training on its library of popular OSS models like DeepSeek, Qwen, gpt-oss, Gemma, and more. Whether users are pre-training large Mixture-of-Experts (MoE) models from scratch, or leveraging the latest Reinforcement Learning (RL) techniques on an OSS model, MaxText provides tutorials and APIs to make things easy. For scalability and resiliency, MaxText leverages &lt;/span&gt;&lt;a href="https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Pathways&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which was originally developed by Google DeepMind and now provides TPU users with differentiated capabilities like elastic training and multi-host inference during RL. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;PyTorch on TPU&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: We recently shared our proposal about our PyTorch native experience on TPUs at &lt;/span&gt;&lt;a href="https://events.linuxfoundation.org/pytorch-conference/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Pytorch Conference 2025&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, including an early preview of training on TPU with minimal code changes. In addition to the framework itself, we are working with the community (&lt;/span&gt;&lt;a href="http://goo.gle/torch-xla-rfc" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;RFC&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;), investing in reproducible recipes, reference implementations, and migration tools to enable PyTorch users to use their favorite frameworks on TPUs. Expect further updates as this work matures.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;vLLM TPU (Serving): &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;vLLM TPU is now powered by &lt;/span&gt;&lt;a href="https://github.com/vllm-project/tpu-inference" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;tpu-inference&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, an expressive and powerful new hardware plugin that unifies JAX and PyTorch under a single lowering path – meaning both frameworks are translated to optimized TPU code through one common, shared backend. This new unified backend is not only faster than the previous generation of vLLM TPU but also offers broader model coverage. This integration provides more flexibility to JAX and PyTorch users, running PyTorch models performantly with no code changes while also extending native JAX support, all while retaining the standard vLLM user experience and interface.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3.4 Extreme performance: Custom kernels via Pallas&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;While XLA is powerful, cutting-edge research often requires novel algorithms e.g. new attention mechanisms, custom padding to handle dynamic ragged tensors and other optimizations for custom MoE models that the XLA compiler cannot yet optimize.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The JAX ecosystem solves this with &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Pallas&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, a JAX-native kernel programming language embedded directly in Python. Pallas presents a unified, Python-first experience, dramatically reducing cognitive load and accelerating the iteration cycle. Other platforms lack this unified, in-Python approach, forcing developers to fragment their workflow. To optimize these operations, they must drop into a disparate ecosystem of lower-level tools—from DSLs like Triton and cuTE to raw CUDA C++ and PTX. This introduces significant mental overhead by forcing developers to manually manage memory, streams, and kernel launches, pulling them out of their Python-based environment&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This is a clear example of co-design. Developers use Pallas to explicitly manage the accelerator's memory hierarchy, defining how "tiles" of data are staged from HBM into the extremely fast on-chip SRAM to be operated on by the MXUs. Pallas has two main parts to it. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Pallas:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The developer defines the high-level algorithmic structure and memory logistics in Python.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Mosaic:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; This compiler backend translates the Pallas definition into optimized TPU machine code. It handles operator fusion, determines optimal tiling strategies, and generates software pipelines to perfectly overlap data transfers (HBM-to-SRAM) with computation (on the MXUs), with the sole objective of saturating the compute units.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Because Pallas kernels are JAX-traceable, they are fully compatible with &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;jit&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;vmap&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;grad&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. This stack provides Python-native extensibility for both JAX and PyTorch, as PyTorch users can consume Pallas-optimized kernels without ever leaving the native PyTorch API. Pallas kernels for PyTorch and JAX models, on both TPU and GPU, are available via &lt;/span&gt;&lt;a href="https://github.com/openxla/tokamax" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Tokamax&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, the ML ecosystem’s first multi-framework, multi-hardware kernel library.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;3.5 Performance engineering: Observability and debugging&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The Ironwood stack includes a full suite of tools for performance analysis, bottleneck detection, and debugging, allowing developers to fully optimize their workloads and operate large scale clusters reliably, &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Cloud TPU metrics&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Exposes key system-level counters (FLOPS, HBM bandwidth, ICI traffic) to Google Cloud Monitoring that can then be exported to popular monitoring tools like Prometheus. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;TensorBoard&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Visualizes training metrics (loss, accuracy) and hosts the XProf profiler UI.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;XProf (OpenXLA Profiler)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The essential toolset for deep performance analysis. It captures detailed execution data from both the host-CPU and all TPU devices, providing:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li style="list-style-type: none;"&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Trace Viewer&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: A microsecond-level timeline of all operations, showing execution, collectives, and "bubbles" (idle time).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Input Pipeline Analyzer&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Diagnoses host-bound vs. compute-bound bottlenecks.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Op Profile:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Ranks all XLA/HLO operations by execution time to identify expensive kernels.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Memory Profiler&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Visualizes HBM usage over time to debug peak memory and fragmentation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Debugging Tools:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li style="list-style-type: none;"&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;JAX Debugger (&lt;/strong&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;jax.debug&lt;/strong&gt;&lt;/code&gt;&lt;strong style="vertical-align: baseline;"&gt;):&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Enables &lt;/span&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;print&lt;/strong&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and breakpoints from within &lt;/span&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;jit&lt;/strong&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;-compiled functions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;TPU Monitoring Library:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; A real-time diagnostic dashboard (analogous to &lt;/span&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;nvidia-smi&lt;/strong&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;) for live debugging of HBM utilization, MXU activity, and running processes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Beyond performance optimization, developers and infra admins can view fleet efficiency and goodput metrics at various levels (e.g., job, reservation) to ensure maximum utilization of their TPU infrastructure.  &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;4. Conclusion&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The Ironwood stack is a complete, system-level co-design, from the silicon to the software. It delivers performance through a dual-pronged strategy: the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;XLA compiler&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; provides broad, "out-of-the-box" optimization, while the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Pallas and Mosaic stack&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; enables hand-tuned kernel performance.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This entire co-designed platform is accessible to all developers, providing first-class, native support for both the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;JAX &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;and the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;PyTorch ecosystem&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. Whether you are pre-training a massive model, running complex RL alignment, or serving at scale, Ironwood provides a direct, resilient, and high-performance path from idea to supercomputer.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Get started today with &lt;/span&gt;&lt;a href="https://docs.vllm.ai/projects/tpu/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;vLLM on TPU&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for inference and &lt;/span&gt;&lt;a href="https://maxtext.readthedocs.io/en/latest/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;MaxText&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for pre-training and post-training.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 06 Nov 2025 13:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>From silicon to softmax: Inside the Ironwood AI stack</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Diwakar Gupta</name><title>Distinguished Engineer, Google Cloud</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Manoj Krishnan</name><title>Principal Engineer</title><department></department><company></company></author></item><item><title>Expanding our NVIDIA partnership: Now shipping A4X Max, Vertex AI Training, and more</title><link>https://cloud.google.com/blog/products/compute/now-shipping-a4x-max-vertex-ai-training-and-more/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today's AI models are moving from billions to trillions of parameters, and are capable of complex, multi-modal reasoning. This leap in sophistication demands a new class of purpose-built infrastructure and software to handle the immense computational and memory requirements of these next-generation models.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;At Google Cloud, we’re committed to empowering developers and organizations to build and deploy what's next in AI. Today, we are excited to deepen our partnership with NVIDIA with a suite of new capabilities that strengthens our platform for the entire AI lifecycle:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;New A4X Max instances&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; powered by NVIDIA’s GB300 NVL72, purpose-built for multimodal AI reasoning tasks&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Google Kubernetes Engine (GKE), now supporting &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Dynamic Resource Allocation Kubernetes Network Driver &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;(&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;DRANET)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;,&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;boosting bandwidth in distributed AI/ML workloads&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;GKE Inference Gateway&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;,&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;now integrating with&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;NVIDIA NeMo Guardrails &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Vertex AI Model Garden&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to feature NVIDIA Nemotron models&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/new-capabilities-in-vertex-ai-training-for-large-scale-training?e=48754805"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI Training&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; recipes on top of the NVIDIA NeMo Framework and NeMo-RL&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Let’s take a closer look at these developments.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;A4X Max with NVIDIA GB300 GPUs&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A4X Max is now shipping in production.  These new instances, powered by NVIDIA GB300 NVL72, are optimized for the most demanding, multimodal AI reasoning workloads.  A4X Max includes 72 Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs connected with NVIDIA’s fifth-generation high-speed GPU interconnect NVIDIA NVLink to function as a single, unified compute platform with shared memory and high-bandwidth communication. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Together with Google's Titanium ML adapter and Google Cloud's Jupiter network fabric, A4X Max is purpose-built to scale to tens of&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; thousands of GPUs in non-blocking, rail-optimized clusters. Compared to A4X powered by NVIDIA GB200 NVL72, A4X Max delivers 2x the network bandwidth on each system.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A4X Max leverages Google Cloud’s &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/managed-slurm-and-other-cluster-director-enhancements"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cluster Director&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, letting you combine optimized compute, networking, and Google’s storage offerings into a cohesive, performant, and easily managed environment. Cluster Director manages the complete lifecycle of A4X Max clusters — from provisioning and topology-aware placement across the NVL72 domains, to providing powerful observability and resiliency capabilities. It integrates with optimized storage solutions like Managed Lustre, while a managed pre-configured Slurm environment offers fault-tolerant scalable job scheduling for A4X Max. Cluster Director also provides deep observability into job and system performance across the GPUs, NVLink and DC networking fabrics. To maximize throughput, Cluster Director helps ensure high reliability with features like automatic straggler detection and in-job recovery. Cluster Director capabilities like topology aware scheduling, maintenance management, and faulty node reporting are also available transparently through Google Kubernetes Engine (GKE), enabling customers to stay in the GKE environment while running A4X Max.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;What all this this means for your workloads:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Optimized reasoning and inference: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;With its 72-GPU NVLink domain, delivering 1.5x FP4 FLOPs, 1.5x HBM memory capacity, and 2x the network bandwidth compared to A4X, A4X Max is specifically designed for low-latency inference, especially for the largest reasoning models. When integrated with GKE Inference Gateway, you benefit from prefix-aware load balancing, improving Time to First Token latency for prefix-heavy workloads. Disaggregated serving can also be enabled to further optimize performance. This is achieved by leveraging Inference Gateway, llm-d, and vLLM together, resulting in significant throughput improvements.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Enhanced training and serving performance:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; With more than 1.4 exaflops per GB300 NVL72 system, A4X Max offers a 4x increase in LLM training and serving performance compared to A3 VMs powered by NVIDIA H100 GPUs. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Maximum scalability and parallelization:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Based on RDMA over Converged Ethernet (RoCE), A4X Max’s networking fabric delivers low-latency high-performance GPU-to-GPU collectives for distributed training and disaggregated serving workloads. By leveraging a new data-center-scaling design, A4X Max clusters can be 2x larger compared to A4X clusters. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The preview of A4X Max instances comes on the heels of our &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/compute/g4-vms-powered-by-nvidia-rtx-6000-blackwell-gpus-are-ga"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;new G4 VMs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, and support for NVIDIA Omniverse libraries. Taken together, these offerings underscore our commitment to delivering an end-to-end platform for every AI workload, while our deepening partnership with NVIDIA provides you with a powerful, comprehensive ecosystem to build what's next in AI.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Increased RDMA performance with GKE DRANET&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Today, we’re deploying managed DRANET into production, starting with A4X Max. By enabling topology-aware scheduling of GPUs and RDMA network interface cards, DRANET &lt;/span&gt;&lt;a href="https://github.com/google/dranet/blob/main/site/static/docs/kubernetes_network_driver_model_dranet_paper.pdf" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;boosts bus bandwidth&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for all-gather and all-reduce operations in distributed AI/ML workloads. This translates to improved cost efficiency due to better VM utilization. It does this by scheduling GKE Pods on nodes where the RDMA device and the GPU have the best possible connectivity. DRANET also simplifies RDMA management by making RDMA devices first-class, native resources within GKE. Learn more about DRANET for GKE &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/networking/introducing-managed-dranet-in-google-kubernetes-engine"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;GKE and NVIDIA NeMo Guardrails&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As organizations deploy their AI models into production, they must ensure their safety, security, and responsible behavior. Today, we are announcing the integration of NVIDIA NeMo Guardrails with &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GKE Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, an extension to GKE Gateway for serving generative AI applications.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;GKE Inference Gateway optimizes model serving with features like model-aware routing and autoscaling, while NeMo Guardrails add a critical layer of safety, preventing models from engaging in undesirable topics or responding to malicious prompts. Together, they offer a secure, scalable, and manageable inference solution to speed up your AI initiatives.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Vertex AI Model Garden to feature NVIDIA Nemotron models&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To give developers greater choice and performance,&lt;/span&gt;&lt;a href="https://cloud.google.com/model-garden"&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI Model Garden&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; will soon have support for NVIDIA’s Nemotron family of open models as NVIDIA NIM microservices. This integration — starting with the upcoming availability of the NVIDIA Llama Nemotron Super v1.5 model — will give developers and organizations access to the NVIDIA’s latest open-weight models directly within Vertex AI. With a Vertex AI managed deployment, you can rapidly develop and deploy custom AI agents powered by Nemotron models, all while maintaining control over performance, cost, and compliance. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Models deployed through Vertex AI offer the following benefits :&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Granular control&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; over your deployments, with the ability to optimize for performance or cost by selecting from a wide range of machine types and Google Cloud regions. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Robust &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;security&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; by deploying models entirely within your own VPC and adhering to your VPC-SC policies. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Incredible &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;ease of use &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;— you can discover, license, and deploy these cutting-edge models in just a few clicks. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Vertex AI Training with NVIDIA NeMo Integration&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/new-capabilities-in-vertex-ai-training-for-large-scale-training?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI Training&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; provides the essential control and flexibility enterprises need to adapt foundation models to their proprietary data. To accelerate the creation of highly accurate, proprietary models, we are announcing expanded capabilities in Vertex AI Training that simplify and accelerate the path to developing large-scale models.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Customers benefit from a fully managed and resilient Slurm environment that simplifies large-scale training. Automated resiliency features improve cluster uptime. Our comprehensive data-science tooling removes much of the guesswork from complex model development. Finally, curated and optimized pre-training and post-training recipes built on top of standardized frameworks like NVIDIA NeMo and NeMo-RL empower builders to move from a novel idea to a production-ready, domain-specialized model with greater speed and efficiency.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Take the next steps&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These updates enhance the capabilities and flexibility of our Google Cloud platform for running AI workloads. You can choose between the flexibility and control of infrastructure as a service (IaaS) with Google Compute Engine or GKE with Cluster Director; or the fully managed, end-to-end experience of Vertex AI, which provides a secure, scalable, and simplified workflow to train, tune, and manage models.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Together, t&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;hese infrastructure innovations represent a significant step forward in our mission to provide a complete platform for AI development and deployment. The combination of Google Cloud’s infrastructure and NVIDIA’s latest technology provides a solid foundation for building the next generation of AI applications.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To get started with the A4X Max preview, please contact your Google Cloud sales representative. Vertex AI Training, meanwhile, has everything you need to transform your models into proprietary assets that define your business advantage. To deploy and manage AI models at scale with enterprise-grade security and efficiency, &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;learn how GKE Inference Gateway can help you serve inference workloads&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. We are excited to see what you will build.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Tue, 28 Oct 2025 18:30:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/compute/now-shipping-a4x-max-vertex-ai-training-and-more/</guid><category>AI &amp; Machine Learning</category><category>Compute</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Expanding our NVIDIA partnership: Now shipping A4X Max, Vertex AI Training, and more</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/compute/now-shipping-a4x-max-vertex-ai-training-and-more/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Mark Lohmeyer</name><title>VP and GM, AI and Computing Infrastructure</title><department></department><company></company></author></item></channel></rss>