Systems

How Google Does It: Fleet-wide, large-scale A/B experimentation

Mon, 18 May 2026 16:00:00 +0000

When most people think of A/B experimentation, they think of button colors, landing page layouts, or checkout flows. At Google, many fundamental infrastructure improvements also need the rigor of A/B experimentation. Optimizing a memory allocator or a kernel scheduler can unlock massive savings in compute resources and slash latency for millions of users. But experimenting with such critical changes is inherently risky; a buggy kernel update doesn't just result in an unhappy user, it can take down large swaths of machines. To innovate safely and at scale, you must perform A/B experimentation on the infrastructure itself.

In this blog post, we summarize the key lessons from Google's A/B experimentation methodology that we’ve refined over multiple years. This blog post is meant as a resource of best practices to follow and initiate a broader discussion around this topic (similar to our performance tips series). Specifically, we highlight four pillars of Google’s A/B experimentation infrastructure:

Application-level vs. machine-level experimentation
Maintaining a balanced setup
Ensuring binary hermeticity
Selecting the right performance metrics

In the following section, we explore why these aspects are critical and how we have handled these at Google.

Where infrastructure experiments matter

Experiments targeting the core building blocks of your stack — such as the operating system, core libraries, compilers and the cluster management system — help unlock performance and efficiency gains that application experiments simply cannot achieve alone. These experiments provide a reliable and safe way to measure impact at a massive scale using a representative subset of the fleet. This helps inform which optimizations are worthwhile.

At Google, we see high value in optimizing the following specific components:

Core libraries: Optimizing core libraries, such as TCMalloc, has a profound impact on all binaries running in the fleet.
Compiler: Tweaking different compilation and build flags can yield fleet-wide performance gains without changing a single line of source code.
Kernel: Improving kernel subsystems, such as memory management or scheduling, can dramatically boost machine efficiency across the fleet, while also helping to uncover large-scale regressions.
Cluster management system: Decisions made by entities like kube-scheduler and kubelet have a significant impact on locality and antagonism, which in turn impact cluster utilization and performance.

Scale of improvements

It’s easy to measure changes that bring large improvements, such as rewriting an entire memory allocator for a 2x efficiency gain. At Google scale, however, the reality is a lot of our infrastructure improvements are much smaller — typically sub-1% gains. But while an individual change may seem minor, a sustained sequence of small optimizations accumulates with time, leading to moonshot-like returns.

Achieving these sub-1% gains demands careful thinking about experimentation and measurement. To do so, we built a robust framework for reliably performing A/B experiments and measuring the impact of small changes.

The limitations of application-level experimentation

To evaluate infrastructure changes, you could enable the change on a specific set of applications and observe metric shifts. However, this application-centric approach has several critical drawbacks:

Selection bias: Individual applications may not be suitable for evaluating specific changes. For instance, an application that allocates memory only during startup is a poor candidate for testing updates to a memory allocation library.
Lack of fleet representation: A small group of applications rarely reflects the behavior of the entire fleet, leading to inaccurate estimates of potential improvements.
Invisible system-wide benefits: Measuring isolated applications fails to capture concomitant effects. For instance, a change that improves hardware cache performance may benefit all applications running on a machine, not just the one being measured.
Technical constraints: Fundamental system changes, such as those made to the kernel or cluster scheduler, simply cannot be evaluated effectively through individual applications alone.

Our approach: Machine-level experimentation

At Google, we overcome these hurdles by enabling changes on individual machines rather than specific applications. All workloads running on a chosen machine actuate the change, allowing us to measure the impact across the entire fleet. This approach captures concomitant effects for all co-located applications and enables the evaluation of system-specific changes that application-level tests would miss.

Infrastructure experiments at Google

A typical experiment selects 1% of the fleet for both the experiment and control groups. The experiment is then gradually rolled out in waves following internal best practices. During the rollout, the framework starts collecting data for both groups, and continuously analyzes it to measure impact on performance and detect production regressions.

The importance of a balanced setup

The validity of an infrastructure experiment hinges on how you select machines. You need to have balanced experiment and control groups, each targeting a representative subset of your machine fleet. Google’s fleet has several different machine types. The proportion of these machine types in the experiment and the control groups should closely match. When the machine types in the experiment and control groups did not closely match in just two clusters, we noticed a 0.2-0.3% data skew! For any sub-1% improvement, this is enough to invalidate the result.

The experiment and the control groups must also not be too large, or you risk the infrastructure’s overall reliability. Nor should they be too small, leading to statistically insignificant data. At Google, we concluded that a 1% subset of fleetwide machines hits the sweet spot.

Implementation-wise, we select 1% subsets of the fleet with a proportional representation of different generations of machines in each cluster. All these subsets are completely equivalent to one another and are rebalanced periodically using linear programming to limit the churn. When we deploy an experiment, two subsets are picked to serve as the experiment and control groups. Both groups roll out at the same pace so a balanced A/B analysis can be performed at any time.

Ensuring reliability with binary hermeticity

For an experiment that modifies the behavior of a library, binaries must be recompiled with the experimental library change. Crucially, the experimental logic only activates when the binary is running on a machine in the experiment group.

Our experimentation framework includes a critical safety requirement designed to ensure that experiment rollbacks are simple and reliable. Any experiment that alters the behavior of individual binaries must follow a two-step rollout process:

First, the experiment is rolled out to all machines in the experiment and control groups.
Only then should the experimental library change be submitted, so that newly compiled binaries activate the experiment when they are deployed to the experiment machines.

This sequential approach makes it easy to tie any behavioral changes to binary releases. More importantly, it guarantees that rolling back to the previous version of the binary safely and immediately undoes the experiment.

What happens if you don’t follow this rule? When experiments cause production outages, it becomes extremely difficult to debug and mitigate them. To fully remove the experiment, you would need to roll back the machines and restart all the affected binaries. This delay could significantly slow down outage mitigation and could potentially cause even more damage.

To illustrate the risks, imagine running a core library optimization experiment that causes sporadic memory corruption. If the rollout steps above are switched, newly built binaries will contain the experimental change before the machine-level rollout even begins. Owners might incorrectly believe that these binaries are safe, since the experiment has not been activated yet. However, once the machine rollout starts, only the binaries built after the library change will show memory corruption, but others would not. This staggered failure makes it incredibly difficult to isolate the experiment as the root cause. If the owners attempt to roll back to the previous version of the binary, the corruption will persist, making incident response less effective. The sudden appearance of memory corruption and the search process needed for the right version of the binary complicates debugging.

Measuring what matters

Infrastructure experiments do not care about click-through rates, where short-term boosts may come at the cost of long term engagement. Instead, the focus is on the performance and the health of the applications and machines, as measured by:

Application productivity: At Google, we internally debated about the best metric to accurately measure performance. It is well known in academia that Instructions Per Cycle (IPC) is not a suitable metric for multiprocessor/multiprogrammed workloads. Instead, we opted for a robust, application-defined productivity metric that captures the amount of work done by the application. For example, a search web server reports the number of search queries completed per second as its productivity metric.
Machine-level performance: While application productivity is our gold standard, we also measure other metrics like IPC, cache misses, and memory bandwidth to corroborate performance improvements.
Reliability: We track multiple reliability metrics, including abnormal terminations and machine timeouts. Sure, a new kernel might be faster, but if it introduces new crashes, it is considered a failure and must be fixed.

We periodically collect these metrics from all applications on all machines in our fleet.

Statistical tools

Running experiments and collecting data is only half of evaluating an optimization. A significant challenge lies in analyzing the collected data to understand how a change will truly impact the fleet before a full rollout. With thousands of jobs running in our data centers at any one moment, any single change has a varied effect on job performance. Further, it’s impossible to make an informed decision based on just a handful of jobs. To overcome this, we developed advanced statistical tools that meticulously match jobs running on the experiment group with comparable jobs on the control group. We then compare the metrics from these matched pairs and aggregate the results across all jobs to generate reliable metrics about the entire fleet. This comparison is done using data spanning several weeks to ensure that daily fluctuations do not skew our findings.

Furthermore, we regularly study data from A/A experiments to understand the variance caused by daily fluctuations. The data from these A/A experiments is used to establish a “noise floor.” Only changes that produce results significantly above this noise floor are considered robust and worthy of being rolled out.

Summary

As cloud infrastructure continues to expand, squeezing every bit of efficiency out of our resources is no longer just an advantage — it is necessary for sustainable cost management. This demand for efficiency requires a rigorous, data-driven approach to validate new optimizations, making robust, fleet-wide A/B experimentation infrastructure essential. However, this presents substantial challenges, encompassing configuration management, reliability, and statistical analysis — none of which are exclusive to our environment. By opening up about the infrastructure we have built at Google, we aim to spark a new wave of research and collaboration. We hope that sharing these hard-earned lessons will:

Offer researchers and practitioners a window into the intricate, high-stakes world of infrastructure-level experimentation
Provide a proven blueprint that others can adapt and improve upon within their own environments
Ignite bold new innovations that push the boundaries of what's possible in systems performance.

Introducing Virgo Network, Google’s scale-out AI data center fabric

Wed, 22 Apr 2026 12:00:00 +0000

The AI era requires a fundamental rethink of physical cloud architecture — networking, in particular. With foundational model parameters growing exponentially, traditional general-purpose networks are reaching their breaking points. To fuel the next decade of machine learning, Google designed Virgo Network, a new megascale AI data center fabric that embraces a "campus-as-a-computer" philosophy, and that underpins our AI Hypercomputer.

Legacy network designs simply cannot handle some of the constraints of modern AI:

Massive scale: Training demands now exceed the power and space of a single data center, requiring unified, multi-data-center domains.
Explosive bandwidth growth: Because foundational model training is heavily network-bound, the required bandwidth per accelerator has surged significantly over the last few years, creating throughput and congestion bottlenecks for older architectures.
Synchronized bursts: Intense, millisecond-level traffic spikes (figure 1) put immense pressure on network buffers. The outcome is that even a single "straggler" node can throttle the entire cluster’s performance.
Low latency: ML serving requires fast, consistent response times to deliver real-time inference, making strict latency control a critical architectural constraint.

Figure 1: Sub-millisecond line-rate bursts of an AI training workload

Reimagining the data center network

Meeting the demands of the AI era requires a fundamental shift away from general-purpose network design towards a specialized flat, low-latency network architecture. To address the unique scale and latency constraints, we leverage our proven Jupiter network for north-south traffic and are introducing a new fabric for east-west communication. The resulting architecture consists of three distinct and specialized layers that operate as one unified compute domain:

Scale-up domain: A high-bandwidth, low-latency interconnect fabric designed for tightly coupled communication between accelerators within a single pod.
Scale-out accelerator fabric (east-west): A dedicated accelerator-to-accelerator remote direct memory access (RDMA) fabric optimized for massive horizontal scale across pods. This layer is engineered for deterministic latency and maximum resilience, to provide high “goodput” for the ML workload.
Jupiter front-end network (north-south): A high-capacity fabric that provides fast, reliable access to distributed storage and general-purpose compute resources. It ensures that data access does not become a bottleneck for training and serving workloads, and is also used to scale-across multiple sites for very large training runs.

This architectural decoupling provides key strategic advantages:

Independent evolution: We can evolve and upgrade each network domain independently, preventing system-wide disruptions while accelerating the innovation cycle.
Dedicated scale-out bandwidth: A non-blocking network delivers massive bisectional bandwidth to accelerators for critical training tasks.
ML and network co-design: The network is built in lockstep with each new generation of ML accelerators, helping ensure the fabric is matched to the hardware it supports.

Figure 2: Data center network architecture

Introducing Virgo Network: Megascale data center fabric

Virgo Network is a scale-out fabric designed for the extreme requirements of modern AI workloads. Built on high-radix switches that reduce network layers by allowing more ports per switch, it employs a flat, two-layer non-blocking topology. Compared with traditional datacenter networks, this significantly reduces latency by minimizing network tiers. It features a multi-planar design with independent control domains to connect accelerators (figure 3). The accelerator racks also connect with the Jupiter north-south fabric to access compute and storage services. Together, this streamlined architecture delivers the massive bisection bandwidth and deterministic low latency necessary for both distributed training and serving workloads.

Figure 3: Megascale data center fabric (Virgo Network)

Virgo Network is the foundation of our next-generation accelerator designs and delivers the following advantages:

Massive fabric scale: Virgo Network can link 134,000 chips (TPU 8t) with up to 47 petabits/sec of non-blocking bi-sectional bandwidth in a single fabric.
Generational performance leap: With up to 4x the bandwidth per accelerator (TPU 8t) over the previous generation, Virgo Network delivers the bandwidth you need to get the full power of every chip.
Predictable low latency: Virgo Network delivers 40% lower unloaded fabric latency for TPUs compared to previous generation leading to more predictable performance for latency sensitive AI workloads.

Improving reliability at scale

In a system supporting hundreds of thousands of chips, hardware failures are inevitable. Because a single faulty component can disrupt a synchronized training job, reliability at scale is a primary focus. To maximize workload goodput, we designed the Virgo Network architecture around fault isolation, deep observability, and the rapid mitigation of hangs and stragglers.

At this scale, system-wide resilience requires a solid network foundation. Virgo Network integrates independent switching planes that provide robust fault isolation, protecting cluster-wide goodput from being degraded by localized hardware failures.

Figure 4: How fail-stop and fail-slow impact MTTR

Building on this foundation, we optimize the software and orchestration stack to maximize mean-time between interruptions (MTBI) and minimize mean-time to recovery (MTTR) through two primary areas:

Observability: Reliability at scale requires high-fidelity visibility. We use sub-millisecond telemetry to monitor network systems. This deep visibility allows us to detect transient congestion, optimize buffer management, and pinpoint the root causes of slowdowns across the hardware and software stack.
Identifying stragglers and hangs: Proactive monitoring is critical for identifying nodes that are experiencing performance degradation (stragglers) or that have stopped responding completely (hangs). By rapidly localizing these bottlenecks, with automated straggler and newly added hang detection, we accelerate the training job and protect it from localized slowdowns.

The foundation of the AI Hypercomputer

Virgo Network is a reimagined scale-out data center network custom-built for the stringent demands of modern AI workloads. This flat, multi-planar architecture unifies accelerators across pods into a single compute domain, addressing the bandwidth and scale limitations of traditional networks. By providing robust fault isolation directly at the hardware level, Virgo Network serves as the foundation for system-wide resilience, protecting synchronized workloads from localized hardware faults.

Ultimately, Virgo Network delivers the scale, predictable latency, and reliability necessary to accelerate the agentic AI era. To learn more about how we are building infrastructure for the future of AI, visit our AI infrastructure solutions page, explore the technical documentation, or attend the dedicated breakout session at Google Cloud Next.

AI infrastructure efficiency: Ironwood TPUs deliver 3.7x carbon efficiency gains

Mon, 06 Apr 2026 16:00:00 +0000

At Google, we are committed to being transparent about the environmental impact of our AI infrastructure, publishing metrics on the lifetime emissions of our chips — from manufacturing to powering these chips in the data center. Today, we are updating these metrics for our seventh-generation TPU, Ironwood, which demonstrates an approximately 3.7x improvement in Compute Carbon Intensity (CCI) compared to TPU v5p, the previous generation of performance-optimized TPUs.¹

In other words, despite the fact that AI is driving demand for additional compute resources, our ongoing work to optimize AI hardware is helping to improve the energy consumption and emissions of AI workloads.

Measuring AI accelerator efficiency: Compute Carbon Intensity (CCI)

To help manage the environmental impact of AI workloads, we monitor the Compute Carbon Intensity (CCI) of our AI accelerator hardware. CCI is defined in An Introduction to Life-Cycle Emissions of Artificial Intelligence Hardware²as the estimated amount of CO2 equivalent emitted for every utilized floating-point operation (CO2e/FLOP). This metric provides a holistic, chip-level view by including both the embodied emissions associated with manufacturing, transportation, and data center construction (Scope 3), as well as the operational emissions associated with running these chips in data centers (Scope 1 and 2).

The Ironwood advantage: high performance, low footprint

Google’s TPU CCI continues to improve with each chip generation. Drawing from empirical data measured in January 2026, Ironwood demonstrates a remarkable 3.7x improvement in CCI relative to TPU v5p. This accelerates efficiency gains from the 1.2x CCI improvement of TPU v5p relative to TPU v4, and demonstrates continued carbon efficiency optimization of Google’s performance-optimized TPU architecture.

These efficiency gains are driven by outsized compute performance increases between TPU generations relative to growth in machine energy consumption and manufacturing emissions. In fact, fleetwide measurements demonstrate a 5x improvement in utilized FLOPs across generations, from TPU v5p to Ironwood.³ Because the performance denominator in our CCI equation (CO2e/FLOP) is scaling faster than emissions, the net carbon cost per operation drops significantly with every new chip.

^{Figure 1: Ironwood’s accelerating CCI improvement measured on Google’s performance-optimized TPU cohort, considering January 2026 workloads.4}

Operating Google’s TPU fleet more efficiently

Updated TPU CCI metrics also offer a direct comparison to the measurement we published in 2025. Specifically, from October 2024 to January 2026, Google’s versatile TPU cohort ran more efficiently than what we reported previously:

TPU v5e achieved a 43% reduction in total CCI over 15 months, dropping to 228 gCO2e/EFLOP. This was driven by a 72% increase in average utilization.
Trillium, the sixth-generation TPU, saw a 20% reduction in total CCI over the same time period, bringing its emissions intensity down to 125 gCO2e/EFLOP.

^{Figure 2: Google’s versatile TPU cohort demonstrates deployment efficiency gains for the same TPU generations between October 2024 and January 2026.5}

These results demonstrate that Google continues to improve the carbon-efficiency of our AI infrastructure. While the massive scale of AI demand requires a significant and growing amount of power, our innovations allow us to deliver substantially more compute performance for every unit of energy consumed.

Decoupling energy and emissions from performance

To what can we attribute these improvements? Beyond Ironwood’s raw hardware capabilities, these CCI gains are further enabled by deep software and system-level optimizations across our infrastructure:

Software efficiency (MoE): The widespread adoption of sparse architectures, such as Mixture of Experts (MoE), routes computation only to necessary parameters. This drastically reduces the active FLOPs required per inference or training step without sacrificing model capacity or quality.
Lower precision math (FP8): By heavily leveraging 8-bit floating-point (FP8) formats, we effectively double compute throughput and halve memory bandwidth requirements compared to 16-bit formats. This shows that we can maintain output quality while exponentially decreasing the energy cost per mathematical operation.
Workload mix and intelligent scheduling: Advanced fleet orchestration continuously balances the workload mix across our infrastructure. By intelligently scheduling tasks, we ensure high continuous utilization rates, optimize duty cycles, and minimize the carbon penalty of idle power draw.

Scale sustainably with Google Cloud

AI’s trajectory requires infrastructure that can scale exponentially without an equivalent surge in carbon emissions. The 3.7x carbon efficiency improvement from TPU v5p to Ironwood demonstrates that we can achieve greater compute density while minimizing the growth of our energy and environmental footprint through deliberate hardware and software codesign. To learn more and get started with Ironwood, register your interest with this form.

_{1. Following the methodology published in an August 2025 technical report, we quantified the full lifecycle emissions of TPU hardware as a point-in-time snapshot across Google’s generations of TPUs as of January 2026. The functional unit for this study is one AI computer deployed in the data center, which includes one or more accelerator trays (containing TPUs) connected to one host tray (i.e., a computing server). Peripheral components beyond the tray (e.g., rack, shelf, and network equipment) and auxiliary computing and storage resources are excluded from the calculation of embodied and operational emissions. We include the electricity used in data center cooling in operational emissions. To estimate operational emissions from electricity consumption of running workloads, we used a one month sample of observed machine power data from our entire TPU fleet, applying Google’s 2024 average fleetwide carbon intensity. To estimate embodied emissions from manufacturing, transportation, and retirement, we performed a life-cycle assessment of the hardware. Data center construction emissions were estimated based on Google’s disclosed 2024 carbon footprint. These findings do not represent model-level emissions, nor are they a complete quantification of Google’s AI emissions. Based on the TPU location of a specific workload, CCI results of specific workloads may vary.
2. The authors would like to thank and acknowledge the co-authors of this paper for their important contributions to enable these results: Ian Schneider, Hui Xu, Stephan Benecke, Parthasarathy Ranganathan, and Cooper Elsworth.
3. This comparison considers the utilized FLOPS (BF16) between deployed TPU v5p and Ironwood chips in Google’s fleet in January 2026. This trend is consistent with the improvement in peak FLOPS (BF16) between v5p (459 FLOPS) and Ironwood (2,307 FLOPS).
4.The GHG protocol offers two accounting standards for operational emissions. Results presented here consider market-based emissions, which includes the impact of carbon-free energy purchases. Location-based accounting, which excludes carbon-free energy purchases, would raise operational CCI to 793, 712, and 195 gCO2e/EFLOP, respectively. The ratio of CCI improvements would be at a similar level, and Ironwood’s embodied CCI would drop from 23% to 8% of its total CCI.
5. To ensure a fair comparison across varying TPU utilizations, this analysis replicates the propensity score weighting methodology from the August 2025 technical report and compares January 2026 results to the results published in 2025. This statistical technique adjusts for duty cycle variations to balance the comparison of TPUs during a given time period. This empirical methodology results in small variations in calculated CCI between temporal periods, reflecting fluctuations in real-world energy consumption and hardware utilization across the global infrastructure.}

Firefly: Illuminating the path to nanosecond-level clock sync in the data center

Mon, 23 Feb 2026 17:00:00 +0000

From the high-frequency trading floors of Wall Street to orchestrating cloud data centers, the ability to synchronize events with nanosecond accuracy is critical. Yet, achieving this level of temporal precision across thousands of interconnected devices in a modern data center is fraught with challenges like clock drift, network jitter, and path asymmetries. And doing so on cloud-hosted infrastructure has traditionally been impossible, preventing a certain class of applications from running there.

This is where Firefly, a clock synchronization system developed by researchers and engineers at Google, comes in. Firefly isn't just a clock synchronization protocol; it's a software-driven approach that combines theoretical insights and practical engineering to deliver ultra-accurate, scalable, and cost-effective time synchronization on commodity hardware within a demanding data center environment.

The nanosecond race: Why precise timing matters

Precise clock synchronization is the foundation of distributed systems. It is non-negotiable in financial exchanges, where regulatory requirements mandate sub-100µs external synchronization to Coordinated Universal Time, or UTC, and fairness demands sub-10ns internal clock synchronization. In high-frequency trading, a minuscule timing advantage can translate to significant financial gains, making accurate timestamping critical for market integrity. Beyond finance, numerous data center operations, including database consistency, distributed logging, virtual machine management, and network telemetry, rely on accurate temporal ordering of events. And as data centers scale, the need for a robust, scalable synchronization solution becomes even more important.

But achieving nanosecond-level synchronization in a dynamic data center environment is difficult. Several factors conspire to undermine precision:

Clock drift: Crystal oscillators, which are fundamental to all clocks, have inherent imperfections that cause them to gradually deviate over time. Although these deviations were considered minor previously, they are substantial when targeting sub-10ns.
Jitter: Network components such as switches and network interface cards (NICs) introduce unpredictable delays. These delays, often stemming from queuing in network buffers or the intricate processing of packets, can manifest as jitter, disrupting the timing of synchronization messages.
Asymmetry: The network path between two devices is rarely symmetrical. Differences in cable lengths, the number of hops, or the internal workings of network equipment can cause signals to take different amounts of time to travel in opposite directions. This asymmetry can introduce significant errors when estimating one-way delays and clock offsets.
Scalability: As data centers expand to house tens of thousands of servers, any synchronization solution must be able to scale efficiently without becoming a bottleneck or requiring disproportionate resources.
Fault tolerance: In a distributed system, failures are inevitable. A synchronization protocol must be resilient to the loss or misbehavior of individual nodes or network links, so that the overall synchronization accuracy is not compromised.

Firefly: Bridging software and theory

Firefly uses a multi-faceted strategy to tackle these challenges, distinguishing itself from prior synchronization protocols. Its core innovations lie in its architectural design and its theoretical underpinnings.

1. Layered synchronization: Firefly employs a novel layered synchronization technique. Instead of relying on a central clock, which can be a single point of failure or introduce delays, it first establishes tight internal synchronization amongst NICs within the data center. Each NIC in the network constantly communicates with a set of its peers, comparing times and making adjustments. From this "swarm" of devices emerges a highly stable and accurate consensus time that the entire group agrees upon. This internal synchronization is rapid and robust, effectively shielding it from external timing disturbances. Concurrently, Firefly synchronizes the entire swarm to UTC. Decoupling of these two processes is crucial, as it prevents external factors like time-server jitter or drift from directly impacting internal synchronization.

2. Distributed consensus over Random graphs: Unlike traditional hierarchical approaches that can be brittle and susceptible to single points of failure, Firefly uses a distributed consensus algorithm built on a d-regular random graph. This means each NIC communicates with a randomly selected set of 'd' peers. Theoretical analysis, as presented in the Firefly research paper, demonstrates that such random graphs offer significant advantages:

Faster convergence: Random graphs promote a more rapid dissemination of clock information across the network, leading to quicker synchronization.
Scalability: The theoretical bounds show that random graphs can maintain synchronization accuracy even as the size of the network grows, provided the number of peers ('d') scales logarithmically with the total number of nodes.
Resilience to asymmetry: The diverse probing paths inherent in random graphs help to average out and mitigate the impact of path asymmetries.

3. Mitigating jitter and asymmetry in practice: Beyond the theoretical advantages of random graphs, Firefly incorporates practical techniques to further refine accuracy:

RTT filtering: By analyzing round-trip time (RTT) measurements, Firefly can identify and discard probe samples that are likely affected by queuing jitter, thereby improving the accuracy of delay estimations.
Path profiling: Firefly actively probes network paths to identify and favor those with minimal asymmetry. This proactive approach helps to select the most reliable paths for synchronization.
Leveraging hardware: Where available, Firefly can utilize features like Transparent Clock (TC) in network switches to accurately account for in-switch delays, further reducing measurement error.

4. Robustness and fault tolerance: Firefly’s use of distributed consensus, combined with its averaging mechanisms, makes it inherently resilient to failures. By not relying on a single time server or a fixed hierarchical structure, the system can gracefully handle the loss or misbehavior of individual nodes.

Performance in the real world

The results discussed in our Firefly research paper are compelling:

Internal synchronization: Firefly consistently achieves sub-10ns NIC-to-NIC synchronization when used in conjunction with Google's latest data center fabric technology. This can be used to determine order of events like packets, logs, remote procedure calls (RPCs) across machines.
External synchronization: The system also delivers significantly better synchronization to UTC than the 100µs regulatory requirement for financial exchanges.

The offset between a pair of clocks that are six hops away in a Firefly-synced network, measured by an oscilloscope via 1 pulse per second.

The accompanying video illustrates the accuracy of NIC-to-NIC synchronization, as quantified by an oscilloscope utilizing a one-pulse-per-second (1PPS) signal from the NICs. Each row corresponds to a NIC clock, with the rising edge indicating the precise moment the NIC clock attains an integer second. The oscilloscope observations confirm that all measured NICs exhibit close synchronization, maintaining alignment within a few nanoseconds.

These results are particularly impressive given that Firefly operates purely in software on commodity hardware, avoiding the need for expensive, specialized synchronization equipment. This makes ultra-accurate time synchronization accessible to a broader range of data center applications.

A foundation for future applications

Firefly's success in delivering nanosecond-level accuracy in a scalable and cost-effective manner has far-reaching implications:

Democratizing high-precision timing: Firefly allows cloud-hosted financial services that traditionally rely on expensive dedicated hardware, to achieve the required precision using standard cloud infrastructure.
Enabling new applications: The availability of precise, synchronized clocks across data center devices can unlock new possibilities in areas like fine-grained network telemetry and congestion control, time-coordinated distributed systems, and deterministic fabric for ML workloads.
Transforming data center operations: By creating a tightly integrated and precisely timed computing entity, Firefly can enhance data centers’ overall efficiency, reliability, and performance.

In conclusion, Firefly represents a significant advancement in the field of clock synchronization. By ingeniously combining theoretical insights into graph theory and consensus algorithms with practical network engineering techniques, it overcomes the long-standing challenges of achieving nanosecond-level precision in complex, distributed environments. As data centers continue to evolve, systems like Firefly will be instrumental in building the high-performance, reliable, and fair infrastructure of the future.

aside_block: <ListValue: [StructValue([('title', '2026 AI Agent Trends in Financial Services'), ('body', <wagtail.rich_text.RichText object at 0x7fb83fff9e80>), ('btn_text', 'Read it now.'), ('href', 'https://cloud.google.com/resources/content/ai-agent-trends-financial-services-2026'), ('image', <GAEImage: FSI_Confirmation email_500x450>)])]>

At Google, the future is multiarch; AI and automation are helping us get there

Tue, 21 Oct 2025 16:00:00 +0000

Google Axion processors, our first custom Arm®-based CPUs, mark a major step in delivering both performance and energy efficiency for Google Cloud customers and our first-party services, providing up to 65% better price-performance and up to 60% more energy-efficient than comparable instances on Google Cloud.

We put Axion processors to the test: running Google production services. Now that our clusters contain both x86 and Axion Arm-based machines, Google's production services are able to run tasks simultaneously on multiple instruction-set architectures (ISAs). Today, this means most binaries that compile for x86 now need to compile to both x86 and Arm at the same time — no small thing when you consider that the Google environment includes over 100,000 applications!

We recently published a preprint of a paper called "Instruction Set Migration at Warehouse Scale" about our migration process, in which we analyze 38,156 commits we made to Google's giant monorepo, Google3. To make a long story short, the paper describes the combination of hard work, automation, and AI we used to get to where we are today. We currently serve Google services in production on Arm and x86 simultaneously including YouTube, Gmail, and BigQuery, and we have migrated more than 30,000 applications to Arm, with Arm hardware fully-subscribed and more servers deployed each month.

Let's take a brief look at two steps on our journey to make Google multi-architecture, or ‘multiarch’: an analysis of migration patterns, and exploring the use of AI in porting the code. For more, be sure to read the entire paper.

Migrating all of Google's services to multiarch

Going into a migration from x86-only to Arm and x86, both the multiarch team and the application owners assumed that we would be spending time on architectural differences such as floating point drift, concurrency, intrinsics such as platform-specific operators, and performance.

At first, we migrated some of our top jobs like F1, Spanner, and Bigtable using typical software practices, complete with weekly meetings and dedicated engineers. In this early period, we found evidence of the above issues, but not nearly as many as we expected. It turns out modern compilers and tools like sanitizers have shaken out most of the surprises. Instead, we spent the majority of our time working on issues like:

fixing tests that broke because they overfit to our existing x86 servers
updating intricate build and release systems, usually for our oldest and highest-traffic services
resolving rollout issues in production configurations
taking care to avoid destabilizing critical systems

Moving a dozen applications to Arm this way absolutely worked, and we were proud to get things running on Borg, our cluster management system. As one engineer remarked, "Everyone fixated on the totally different toolchain, and [assumed] surely everything would break. The majority of the difficulty was configs and boring stuff."

And yet, it's not sufficient to migrate a few big jobs and be done. Although ~60% of our running compute is in our top 50 applications, the curve of usage across the remaining applications in Google's monorepo is relatively flat. The more jobs that can run on multiple architectures, the easier it is for Borg to fit them efficiently into cells. For good utilization of our Arm servers, then, we needed to address this long list of the remaining 100,000+ applications.

The multiarch team could not effectively reach out to so many application owners; just setting up the meetings would have been cost-prohibitive! Instead, we have relied on automation, helping to minimize involvement from the application teams themselves.

Automation tools
We had many sources of automation to help us, some of which we already used widely at Google before we started the multiarch migration. These include:

Rosie, which lets us programmatically generate large numbers of commits and shepherd them through the code review process. For example, the commit could be one line to enable Arm in a job's Blueprint: "arm_variant_mode = ::blueprint::VariantMode::VARIANT_MODE_RELEASE"
Sanitizers and fuzzers, which catch common differences in execution between x86 and Arm (e.g., data races that are hidden by x86's TSO memory model). Catching these kinds of issues ahead of time avoids non-deterministic, hard-to-debug behavior when recompiling to a new ISA.
Continuous Health Monitoring Platform (CHAMP), which is a new automated framework for rolling out and monitoring multiarch jobs. It automatically evicts jobs that cause issues on Arm, such as crash-looping or exhibiting very slow throughput, for later offline tuning and debugging.

We also began using an AI-based migration tool called CogniPort — more on that below.

Analysis
The 38,156 commits to our code monorepo constituted most of the commits across the entire ISA migration project, from huge jobs like Bigtable to myriad tiny ones. To analyze these commits, we passed the commit messages and code diffs into Gemini Flash LLM’s 1M token context window in groups of 100, generating 16 categories of commits in four overarching groups.

Figure 1: Commits fall into four overarching groups.

Once we had a final list, we ran commits again through the model and had it assign one of these 16 categories to each of them (as well as an additional "Uncategorized'' category, which improved stability of the categorization by catching outliers).

Figure 2: Code examples in the first two categories. More examples are available in the paper.

Altogether, this analysis covered about 700K changed lines of code. We plotted the timeline of our ISA migration, normalized, as lines of code per day or month changed over time.

Figure 3: CLs by category by time, normalized.

As you can see, as we started our multiarch toolchain, the largest set of commits were in tooling and test adaptation. Over time, a larger fraction of commits were around code adaptation, aligned with the first few large applications that we migrated. During this phase, the focus was on updating code in shared dependencies and addressing common issues in code and tests as we prepared for scale. In the final phase of the process, almost all commits were configuration files and supporting processes. We also saw that, in this later phase, the number of merged commits rapidly increased, capturing the scale-up of the migration to the whole repository.

Figure 4: CLs by category by time, in raw counts.

It’s worth noting that, overall, most commits related to migration are small. The largest commits are often to very large lists or configurations, as opposed to signaling more inherent complexity or intricate changes to single files.

Automating ISA migrations with AI

Modern generative AI techniques represent an opportunity to automate the remainder of the ISA migration process. We built an agent called CogniPort which aims to close this gap. CogniPort operates on build and test errors. If at any point in the process, an Arm library, binary, or test does not build or a test fails with an error, the agent steps in and aims to fix the problem automatically. As a first step, we have already used CogniPort's Blueprint editing mode to generate migration commits that do not lend themselves to simple changes.

The agent consists of three nested agentic loops, shown below. Each loop executes an LLM to produce one step of reasoning and a tool invocation. The tool is executed and the outputs are attached to the agent's context.

Figure 5: CogniPort

The outermost agent loop is an orchestrator that repeatedly calls the two other agents, the build-fixer agent and the test-fixer agent. The build-fixer agent tries to build a particular target and makes modifications to files until the target builds successfully or the agent gives up. The test-fixer agent tries to run a particular test and makes modifications until the test succeeds or the agent gives up (and in the process, it may use the build-fixer agent to address build failures in the test).

Testing CogniPort

While we only recently scaled up CogniPort usage to high levels, we had the opportunity to more formally test its behavior by taking historic commits from the dataset above that were created without AI assistance. Focusing on Code & Test Adaptation (categories 1-8) commits that we could cleanly roll back (not all of the other categories were suitable for this approach), we generated a benchmark set of 245 commits. We then rolled the commits back and evaluated whether the agent was able to fix them.

Figure 6: CogniPort results

Despite no special prompts or other optimizations, early tests were very encouraging, successfully fixing failed tests 30% of the time. CogniPort was particularly effective for test fixes, platform-specific conditionals, and data representation fixes. We're confident that as we invest in further optimizations of this approach, we will be even more successful.

A multiarch future

From here, we still have tens of thousands more applications to address with automation. To cover future code growth, all new applications are designed to be multiarch by default. We will continue to use CogniPort to fix tests and configurations, and we will also work with application owners on trickier changes. (One lesson of this project is how well owners tend to know their code!)

Yet, we’re increasingly confident in our goal of driving Google's monorepo towards architecture neutrality for production services, for a variety of reasons:

All of the code used for production services is visible in a vast monorepo (still).
Most of the structural changes we need to build, run, and debug multiarch applications are done.
Existing automation like Rosie and the recently developed CHAMP allows us to keep expanding release and rollout targets without much intervention on our part.
Last but not least, LLM-based automation will allow us to address much of the remaining long tail of applications for a multi-ISA Google fleet.

To read even more about what we learned, don't miss the paper itself. And to learn about our chip designs and how we’re operating a more sustainable cloud, you can read about Axion at g.co/cloud/axion.

_{This blog post and the associated paper represents the work of a very large team. The paper authors are Eric Christopher, Kevin Crossan, Wolff Dobson, Chris Kennelly, Drew Lewis, Kun Lin, Martin Maas, Parthasarathy Ranganathan, Emma Rapati, and Brian Yang, in collaboration with dozens of other Googlers working on our Arm porting efforts.}

Agile AI architectures: A fungible data center for the intelligent era

Mon, 13 Oct 2025 23:00:00 +0000

It’s not hyperbole to say that AI is transforming all aspects of our lives: human health, software engineering, education, productivity, creativity, entertainment… Consider just a few of the developments from Google this past year: Magic Cue on the Pixel 10 for more personal, proactive, and contextually-relevant assistance; our viral Nano Banana Gemini 2.5 Flash image generation; Code Assist for developer productivity; and AlphaFold, which won its creators the Nobel prize for chemistry. We like to joke that the past year in AI has been an amazing decade!

Underpinning all these advances in AI are equally amazing advances in the computing infrastructure powering AI. If AI researchers are like space explorers discovering new worlds, then systems and infrastructure designers are the ones building the rockets. But keeping up with the demands of AI services will require even more from us. At Google I/O earlier this year, we announced nearly 50X annual growth in the monthly tokens processed by Gemini models, hitting 480 trillion tokens per month. Since then we have seen an additional 2X growth, hitting nearly a quadrillion monthly tokens. Other statistics paint a similar picture: AI accelerator consumption has grown by 15X in the last 24 months; our Hyperdisk ML data has grown 37X since GA; and we’re seeing more than 5 billion AI-powered retail search queries per month.

With great AI comes great computing

This kind of growth brings with it new challenges. When planning for data centers and systems, we are accustomed to long lead times, paralleling the long time to build out hardware. However, AI demand projections are now changing dynamically and dramatically, creating a significant divergence in supply and demand. This mismatch requires new architectures and system design approaches that can respond to extreme volatility and growth.

Rapid technology innovations are essential, but must be carefully managed across the stack. For example, each generation of AI hardware (like TPUs and GPUs) has introduced new features, functionality, but also power, rack, networking and cooling requirements. The rate of introduction of these new generations is also on the rise, making it hard to build a coherent end-to-end system that can accommodate such a vast rate of change. Further, changes in form factors, board densities, networking topologies, power architectures, liquid cooling solutions, etc., all incrementally compound heterogeneity, so that when taken together, there is a combinatorial increase in the complexity of designing, deploying, and maintaining systems and data centers. In addition, we need to design for a spectrum of data center facilities — beyond traditional hyperscalar- or cloud-optimized offerings to “neoclouds” and industry-standard colocation providers – across multiple geographical regions. This adds yet another layer of diversity and dynamism, further constraining data center design for the new AI era.

We can address these two challenges — dealing with dynamic growth and compounding heterogeneity — if we design data centers with fungibility and agility as first-class considerations. Architectures need to be modular, where components can be designed and deployed independently. They should be interoperable across different vendors or generations. Equally important, they should support the ability to late-bind the facility and systems to handle dynamically changing requirements (for example, reuse infrastructure designed for one generation to the next ). Data centers should also be built on agreed-upon standard interfaces, so data center investments can be reused across multiple customer segments. And finally, these principles need to be applied holistically across all components of the data center – power delivery, cooling, server hall design, compute, storage, and networking.

With great computing comes great power (and cooling and systems)

To achieve agility and fungibility in power, we must standardize power delivery and management to build a resilient end-to-end power ecosystem, including common interfaces at the rack power level. Partnering with other members of the Open Compute Project (OCP), we introduced new technologies around +/-400Vdc designs and an approach for transitioning from monolithic to disaggregated solutions using side-car power, a.k.a. Mt. Diablo. Promising new technologies, like low-voltage DC power combined with solid state transformers, will enable these systems to transition to future fully integrated data center solutions.

We are also evaluating solutions for data centers to become suppliers to the grid, not just consumers from it, with corresponding standardization around battery-operated storage and microgrids. We already used such solutions to manage the challenges around the “spikiness” of AI training workloads and are also applying them for additional savings around power efficiency and grid power usage.

Data center cooling, meanwhile, is also being reimagined for the AI era. Earlier this year, we announced Project Deschutes, a state-of-the-art liquid cooling solution that we contributed to the Open Compute community, and have since published the specification and design collateral. The community is responding enthusiastically, with liquid cooling suppliers like Boyd, CoolerMaster, Delta, Envicool, Nidec, nVent, and Vertiv showcasing demos at major events this year, including the OCP Global Summit and SuperComputing 2025. But we have more opportunities to collaborate on: industry-standard cooling interfaces, new components like rear-door-heat exchangers, reliability, etc. One particularly important area is standardizing layouts and fit-out scopes across colos and third-party data centers, so we as an industry can enable more fungibility.

Finally, we need to bring together compute, networking, and storage in the server hall, including physical attributes of the data center design such as rack height, width, and depth (and more recently, weight); aisle widths and layouts; as well as rack and network interfaces. We also need standards for telemetry and mechatronics to build and maintain these future data centers. With our fellow OCP partners, we are standardizing telemetry integration for third-party data centers, including establishing best practices, developing common naming and implementations, and creating standard security protocols.

Beyond physical infrastructure, we are collaborating with our partners to deliver open standards for more scalable and secure systems. A few highlights include:

Resilience: We’ve expanded our multi-year effort on manageability, reliability and serviceability from GPUs to include CPU firmware updates and debuggability.
Security: Caliptra 2.0, the open-source hardware root of trust, now defends against future threats with post-quantum cryptography, while OCP S.A.F.E. makes security audits routine and cost-effective.
Storage: OCP L.O.C.K. builds on Caliptra’s foundation to provide a robust, open-source key management solution for any storage device.
Networking: Congestion Signaling (CSIG) has been standardized and is delivering measured improvements in load balancing. Alongside continued advancements in SONiC, a new effort is underway to standardize Optical Circuit Switching.

Sustainability is embedded in our work. To provide insight into the environmental impact of AI, we developed a new methodology for measuring the energy, emissions, and water impact of emerging AI workloads, demonstrating that the median Gemini Apps text prompt consumes less than five drops of water and has the energy impact of watching TV for under nine seconds. We apply this type of data-driven approach to other collaborations across the OCP community: on an embodied carbon disclosure specification, green concrete, clean backup power, and reduced manufacturing emissions.

A call to action: community-driven innovation and AI-for-AI

Google has a long history of collaboration with open ecosystems that have demonstrated the compounding power of community collaborations, and we have the opportunity to repeat as we design agile and fungible data centers for the AI era. Join us in the new OCP Open Data Center for AI Strategic Initiative on common standards and optimizations for agile and fungible data centers.

As we look ahead to the next waves of growth in AI, and the amazing advances they will unlock, we will need to leverage these AI advances in our own work, to amplify our productivity and innovation. An early example is Deepmind AlphaChip, which uses AI to accelerate and optimize chip design. We are seeing more promising uses of AI for systems: across hardware, firmware, software, and testing; for performance, agility, reliability, and sustainability; and across design, deployment, maintenance, and security. These AI-enhanced optimizations and workflows are what will bring the next order-of-magnitude improvements to the data center. We look forward to the innovations ahead, and to your continued collaboration in driving them forward.

AI infrastructure is hot. New power distribution and liquid cooling infrastructure can help

Tue, 29 Apr 2025 16:00:00 +0000

AI is fundamentally transforming the compute landscape, demanding unprecedented advances in data center infrastructure. At Google, we believe that physical infrastructure — the power, cooling, and mechanical systems that underpin everything — isn’t just important, but critical to AI’s continued scaling.

We have a long-standing partnership with the Open Compute Project (OCP) that has been instrumental in driving industry collaboration and open innovation in infrastructure. At the 2025 OCP EMEA Summit today, we discussed the power delivery transformation from 48 volts direct current (VDC) to the new +/-400 VDC, which will enable IT racks to scale from 100 kilowatts up to 1 megawatt. We also shared that we’ll contribute our fifth-generation cooling distribution unit, Project Deschutes, to OCP, helping to accelerate adoption of liquid cooling industry-wide.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7fb84c0d44f0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Transforming power delivery with 1 MW per IT rack

Google has a long history of advancing data center power delivery. Almost 10 years ago, we championed the adoption of 48 VDC inside the IT rack to significantly increase the power distribution efficiency and reduce losses compared to what typical 12 VDC solutions delivered. The industry responded to our call to action to collaborate on this technology, and the resulting architecture has worked well, scaling from 10 kilowatts to 100 kilowatts IT racks.

The AI era requires even greater power delivery capabilities for two distinct reasons. The first is simply that ML will require more than 500 kW per IT rack before 2030. The second is the densification of each IT rack, where every millimeter of space in the IT rack is used for tightly interconnected “xPUs” (e.g. GPUs, TPUs, CPUs). This requires a much higher voltage DC power distribution solution, where power components and battery backup are outside of the IT rack.

We are excited to introduce +/-400 VDC power delivery that can support up to 1 MW per rack. This is about much more than simply increasing power delivery capacity — selecting 400 VDC as the nominal voltage allows us to leverage the supply chain established by electric vehicles (EVs), for greater economies of scale, more efficient manufacturing, and improved quality and scale, to name a few. As part of the Mt Diablo project, we are collaborating with Meta, and Microsoft at OCP to standardize the electrical and mechanical interfaces, and the 0.5 specification draft will be available for industry feedback in May.

The first embodiment of this work is an AC-to-DC sidecar power rack that disaggregates power components from the IT rack. This solution improves the end-to-end efficiency by ~ 3% while enabling the entire IT rack to be used for xPUs. Longer term, we are exploring directly distributing higher-voltage DC power within the data center and to the rack, for even greater power density and efficiency.

+/-400 VDC power delivery: AC-to-DC sidecar power rack

The liquid cooling imperative

The dramatic increase in chip power consumption — from 100W chips to accelerators exceeding 1000W — has made advanced thermal management essential. Packing more powerful chips into racks also creates significant challenges for cooling density. Liquid cooling has emerged as the clear solution, given its superior thermal and hydraulic properties. Water can transport approximately 4000 times more heat per unit volume than air for a given temperature change, while the thermal conductivity of water is roughly 30 times greater than air.

At Google, we’ve deployed liquid cooling at GigaWatt scale across more than 2000 TPU Pods in the past seven years with remarkable uptime — consistently at about 99.999%. Google first used liquid cooling in TPU v3 that was deployed in 2018. Liquid-cooled ML servers have nearly half the geometrical volume of their air-cooled counterparts because they replace bulky heatsinks with compact cold plates. This allowed us to double chip density and quadruple the size of our liquid-cooled TPU v3 supercomputer compared to the air-cooled TPU v2 generation.

We’ve continued to refine this technology generation over generation, from TPU v3 and TPU v4, through TPU v5, and most recently, Ironwood. Our implementation utilizes in-row coolant distribution units (CDUs) with redundant components and uninterruptible power supplies (UPS) for high availability. These CDUs isolate the rack's liquid loop from the facility loop, providing a controlled, high-performance cooling system delivered via manifolds, flexible hoses, and cold plates that are directly attached to the high-power chips. In our CDU architecture, named Project Deschutes, the pump and heat exchanger unit is redundant, which is what has enabled us to consistently achieve the above-mentioned fleet-wide CDU availability of ~99.999% since 2020.

We will contribute the fifth-generation Project Deschutes CDU, currently in development, to OCP later this year. This contribution, including system details, specifications, and best practices, is intended to help accelerate the industry's adoption of liquid cooling at scale. Our insights are drawn from nearly a decade of designing and deploying liquid cooling across four generations of TPUs, and encompass:

Design for high cooling performance
Manufacturing quality
Reliability and uptime
Deployment velocity
Serviceability and operational excellence
Supply ecosystem advancements

Project Deschutes CDU: 4th gen in deployment, 5th gen in concept

Get ready for the next generation of AI

We're encouraged by the significant strides the industry has made in power delivery and liquid cooling. However, with the accelerating pace of AI hardware development, it's clear that we must collectively quicken our pace to prepare data centers for what’s next. We're particularly excited about the potential for rapid industry adoption of +/-400 VDC, facilitated by the upcoming Mt Diablo specification. We also strongly encourage the industry to adopt the Project Deschutes CDU design and leverage our extensive liquid cooling learnings. Together, by embracing these advancements and fostering deeper collaboration, we believe the most impactful innovations are still ahead.

How we got to 100 million cells in our global Li-ion rack battery fleet

Tue, 25 Feb 2025 16:00:00 +0000

When it comes to data center power systems, batteries play an important role. The applications that run in our data centers require nearly continuous uptime. And while utility power is highly reliable, power outages are unavoidable.

When an outage happens, batteries can supply short-duration power, allowing servers to operate continuously when the facility switches between AC power sources, or to ride through transient power disturbances. Or, if a facility loses both primary and alternate power sources for an extended period of time, batteries can supply sufficient power to allow machines to execute a clean shutdown procedure. This is helpful in expediting machine restarts after the power outage. More importantly, it helps ensure that critical user data is safely stored to disk and not lost in the power disruption.

aside_block: <ListValue: [StructValue([('title', "Ensure Your Data's Safety and Uptime with Google Cloud for free"), ('body', <wagtail.rich_text.RichText object at 0x7fb84c0dfc40>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

At Google, we rely on a 48Vdc rack power system with integrated battery backup units (BBUs), and in 2015, we became one of the first hyperscale data center providers to deploy Lithium-ion BBUs. These Li-ion batteries had twice the life, twice the power and half the volume of previous-generation lead-acid batteries. Switching from lead-acid batteries to Li-ion means we deploy only one-quarter the number of batteries, greatly reducing the battery waste generated by our data centers.

We recently reached an important milestone: Google has more than 100 million cells deployed in battery packs across our global data center fleet. This is remarkable, and only possible thanks to the safety-first approach we take to deploy Li-ion batteries at scale.

The main safety risk associated with Li-ion batteries is the battery going into thermal runaway if it’s accidentally mishandled or exposed to excessive temperatures or overcharging. While a rare event, the resulting fire is extremely difficult to extinguish due to the large amount of heat generated, driving a thermal runaway chain reaction to nearby cells.

To deploy this large fleet of Li-ion cells, we have had to make safety a core principle of our battery design. Specifically, as an early adopter of the UL9540A thermal runaway test method, we subject our Li-ion BBU designs to rigorous flame safety testing that demonstrates their ability to limit thermal runaway. As a result, Google has successfully been granted permits to deploy BBUs in some of the world’s most stringent jurisdictions, in the APAC region.

In addition, our Li-ion BBUs benefit from our distributed UPS architecture that offers significant availability and TCO benefits compared to traditional monolithic UPS systems. The distributed UPS architecture improves machine availability by: 1) reducing the failure-domain blast radius to a single rack, and 2) locating the batteries in the rack to eliminate intermediate points of failure between the UPS and machines. This architecture also provides TCO benefits by scaling the UPS with the deployment, i.e., reducing day-1 UPS cost. Additionally, locating the batteries in the rack on the same DC bus as the machines eliminates intermediate AC/DC power conversion steps that cause efficiency losses. In 2016 we shared the 48V rack power system spec with the Open Compute Project, including specs for the Li-ion BBUs.

Li-ion batteries have been crucial to ensuring the uninterrupted operation of Google Cloud data centers. By transitioning from lead-acid to Li-ion BBUs, we’ve significantly improved power availability, efficiency, and lifespan, even as we simultaneously address their critical safety risks. Our commitment to rigorous safety testing and adherence to standards and test methods like UL9540A has enabled us to deploy millions of Li-ion BBUs globally, providing our customers with the high level of reliability they expect from Google Cloud.

Getting to 100 million Li-ion batteries is just one of many examples of how we are building a reliable cloud and power-efficient AI. As data center power systems evolve to include new technologies including large battery energy storage systems (BESS) and new workload requirements (AI workloads), we remain dedicated to exploring and implementing innovative solutions to build the most efficient and safest cloud data centers.

^{The authors would like to acknowledge Vijay Boovaragavan, Matt Tamashiro, Sandeep Sebastian, Thibault Pelloux-Gervais, Ken Wong, Mike Meakins, Stanley Fung, and Scott Sharp for their contributions.}

Balance of power: A full-stack approach to power and thermal fluctuations in ML infrastructure

Tue, 11 Feb 2025 17:00:00 +0000

The recent explosion of machine learning (ML) applications has created unprecedented demand for power delivery in the data center infrastructure that underpins those applications. Unlike server clusters in the traditional data center, where tens of thousands of workloads coexist with uncorrelated power profiles, large-scale batch-synchronized ML training workloads exhibit substantially different power usage patterns. Under these new usage conditions, it is increasingly challenging to ensure the reliability and availability of the ML infrastructure, as well as to improve data-center goodput and energy efficiency.

Google has been at the forefront of data center infrastructure design for several decades, with a long list of innovations to our name. In this blog post, we highlight one of the key innovations that allowed us to manage unprecedented power and thermal fluctuations in our ML infrastructure. This innovation underscores the power of full codesign across the stack — from ASIC chip to data center, across both hardware and software. We also discuss the implications of this approach and propose a call to action for the broader industry.

New ML workloads lead to new ML power challenges

Today’s ML workloads require synchronized computation across tens of thousands of accelerator chips, together with their hosts, storage, and networking systems; these workloads often occupy one entire data-center cluster — or even multiples of them. The peak power utilization of these workloads could approach the rated power of all the underlying IT equipment, making power overscription much more difficult. Furthermore, power consumption rises and falls between idle and peak utilization levels much more steeply, due to the fact that the entire cluster’s power usage is now dominated by no more than a few large ML workloads. You can observe these power fluctuations when a workload launches or finishes, or when it is halted, then resumed or rescheduled. You may also observe a similar pattern when the workload is running normally, mostly attributable to alternating compute- and networking-intensive phases of the workload within a training step. Depending on the workload’s characteristics, these inter- and intra-job power fluctuations can occur very frequently. This can result in multiple unintended consequences on the functionality, performance, and reliability of the data center infrastructure.

Fig. 1. Large power fluctuations observed on cluster level with large-scale synchronized ML workloads

In fact, in our latest batch-synchronous ML workloads running on dedicated ML clusters, we observed power fluctuations in the tens of megawatts (MW), as shown in Fig.1. And compared to a traditional load variation profile, the ramp speed could be almost instantaneous, repeat as frequently as every few seconds, and last for weeks… or even months!

Fluctuations of this kind pose the following risks:

Functionality and long-term reliability issues with rack and data center equipment, resulting in hardware-induced outages, reduced energy efficiency and increased operational/maintenance costs, including but not limited to rectifiers, transformers, generators, cables and busways
Damage, outage, or throttling at the upstream utility, including violation of contractual commitments to the utility on power usage profiles, and corresponding financial costs
Unintended and frequent triggering of the uninterrupted power supply (UPS) system from large power fluctuations, resulting in shortened lifetime of the UPS system

Large power fluctuations may also impact hardware reliability at a much smaller per-chip or per-system scale. Although the maximum temperature is well under control, power fluctuations may still translate into large and frequent temperature fluctuations, triggering various forms of interactions including warpage, changes to thermal interface material property, and electromigration.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7fb84c0d4ee0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

A full-stack approach to proactive power shaping

Due to the high complexity and large scale of our data-center infrastructure, we posited that proactively shaping a workload’s power profile could be more efficient than simply adapting to it. Google’s full codesign across the stack — from chip to data center, from hardware to software, and from instruction set to realistic workload — provides us with all the knobs we need to implement highly efficient end-to-end power management features to regulate our workloads’ power profiles and mitigate detrimental fluctuations.

Specifically, we installed instrumentation in the TPU compiler to check on signatures in the workload that are linked with power fluctuations, such as sync flags. We then dynamically balance the activities of major compute blocks of the TPU around these flags to smooth out their utilization over time. This achieves our goal of mitigating power and thermal fluctuations with negligible performance overhead. In the future, we may also apply a similar approach to the workload’s starting and completion phases, resulting in a gradual, rather than abrupt, change in power levels.

We’ve now implemented this compiler-based approach to shaping the power profile and applied it on realistic workloads. We measured the system’s total power consumption and a single chip’s hotspot temperature with, and without, the mitigation, as plotted in Fig. 2 and Fig. 3, respectively. In the test case, the magnitude of power fluctuations dropped by nearly 50% from the baseline case to the mitigation case. The magnitude of temperature fluctuations also dropped from ~20 C in the baseline case to ~10 C in the mitigation case. We measured the cost of the mitigation by the increase in average power consumption and the length of the training step. With proper tuning of the mitigation parameters, we can achieve the benefits of our design with small increases in average power with <1% performance impact.

Fig. 2. Power fluctuation with and without the compiler-based mitigation

Fig. 3. Chip temperature fluctuation with and without the compiler-based mitigation

A call to action

ML infrastructure is growing rapidly and expected to surpass traditional server infrastructure in terms of total power demand in the coming years. At the same time, ML infrastructure’s power and temperature fluctuations are unique and tightly coupled with the ML workload’s characteristics. Mitigating these fluctuations is just one example of many innovations we need to ensure reliable and high-performance infrastructure. In addition to the method described above, we’ve been investing in an array of innovative techniques to take on ever-increasing power and thermal challenges, including data center water cooling, vertical power delivery, power-aware workload allocation, and many more.

But these challenges aren’t unique to Google. Power and temperature fluctuations in ML infrastructure are becoming a common issue for many hyperscalers and cloud providers as well as infrastructure providers. We need partners at all levels of the system to help:

Utility providers to set forth a standardized definition of acceptable power quality metrics — especially in scenarios where multiple data centers with large power fluctuations co-exist within a same grid and interact with one another
Power and cooling equipment suppliers to offer quality and reliability enhancements for electronics components, particularly for use-conditions with large and frequent power and thermal fluctuations
Hardware suppliers and data center designers to create a standardized suite of solutions such as rack-level capacitor banks (RLCB) or on-chip features, to help establish an efficient supplier base and ecosystem
ML model developers to consider the energy consumption characteristics of the model, and consider adding low-level software mitigations to help address energy fluctuations

Google has been leading and advocating for industry-wide collaboration on these issues through forums such as Open Compute Project (OCP) to benefit the data center infrastructure industry as a whole. We look forward to continuing to share our learnings and collaborating on innovative new solutions together.

^{A special thanks to Denis Vnukov, Victor Cai, Jianqiao Liu, Ibrahim Ahmed, Venkata Chivukula, Jianing Fan, Gaurav Gandhi, Vivek Sharma, Keith Kleiner, Mudasir Ahmad, Binz Roy, Krishnanjan Gubba Ravikumar, Ashish Upreti and Chee Chung from Google Cloud for their contributions.}

Designing sustainable AI: A deep dive into TPU efficiency and lifecycle emissions

Wed, 05 Feb 2025 17:00:00 +0000

As AI continues to unlock new opportunities for business growth and societal benefits, we’re working to reduce the carbon intensity of AI systems — including by optimizing software, improving hardware efficiency, and powering AI models with carbon-free energy.

Today we’re releasing a first-of-its-kind study¹ on the lifetime emissions of our Tensor Processing Unit (TPU) hardware. Over two generations — from TPU v4 to Trillium — more efficient TPU hardware design has led to a 3x improvement in the carbon-efficiency of AI workloads.²

Our life-cycle assessment (LCA) provides the first detailed estimate of emissions from an AI accelerator, using observational data from raw material extraction and manufacturing, to energy consumption during operation. These measurements provide a snapshot of the average, chip-level carbon intensity of Google’s TPU hardware, and enable us to compare efficiency across generations.

Introducing Compute Carbon Intensity (CCI)

Our study examined five models of TPUs to estimate their full life-cycle emissions and understand how hardware design decisions have impacted their carbon-efficiency. To measure emissions relative to computational performance and enable apples-to-apples comparisons between chips, we developed a new metric — Compute Carbon Intensity (CCI) — that we believe can enable greater transparency and innovation across the industry.

CCI quantifies an AI accelerator chip’s carbon emissions per unit of computation (measured in grams of CO2e per Exa-FLOP).³ Lower CCI scores mean lower emissions from the AI hardware platform for a given AI workload — for example training an AI model. We've used CCI to track the progress we've made in increasing the carbon-efficiency of our TPUs, and we’re excited to share the results.

Key takeaways

Google’s TPUs have become significantly more carbon-efficient. Our study found a 3x improvement in the CCI of our TPU chips over 4 years, from TPU v4 to Trillium. By choosing newer generations of TPUs — like our 6th-generation TPU, Trillium — our customers not only get cutting-edge performance, but also generate fewer carbon emissions for the same AI workload.
Operational electricity emissions are key. Today, operational electricity emissions comprise the vast majority (70%+) of a Google TPU’s lifetime emissions. This underscores the importance of improving the energy efficiency of AI chips and reducing the carbon intensity of the electricity that powers them. Google’s efforts to run on 24/7 carbon-free energy (CFE) on every grid where we operate by 2030 aims directly at reducing the largest contributor to TPU emissions — operational electricity consumption.
Manufacturing matters. While operational emissions dominate an AI chip's lifetime emissions, emissions associated with chip manufacturing are still notable — and their share of total emissions will increase as we reduce operational emissions with carbon-free energy. The study’s detailed manufacturing LCA helps us target our manufacturing decarbonization efforts towards the highest-impact initiatives. We're actively working with our supply chain partners to reduce these emissions through more sustainable manufacturing processes and materials.

Our significant improvements in AI hardware carbon-efficiency in this paper complement rapid advancements in AI model and algorithm design. Outside of this study, continued optimization of AI models is reducing the number of computations required for a given model performance. Some models that once required a supercomputer to run can now be run on a laptop, and at Google we’re using techniques like Accurate Quantized Training and speculative decoding to further increase model efficiency. We expect model advancements to continue unlocking carbon-efficiency gains, and are working to quantify the impact of software design on carbon-efficiency in future studies.

aside_block: <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud TPU API'), ('body', <wagtail.rich_text.RichText object at 0x7fb84c0b4a00>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectPath=/marketplace/product/google/tpu.googleapis.com'), ('image', None)])]>

Partnering for a sustainable AI future

The detailed approach we’ve taken here allows us to target our efforts to continue increasing the carbon-efficiency of our TPUs.

This life-cycle analysis of AI hardware is an important first step in quantifying and sharing the carbon-efficiency of our AI systems, but it's just the beginning. We will continue to analyze other aspects of AI’s emissions footprint — for example AI model emissions and software efficiency gains — and share our insights with customers and the broader industry.

Together, we can harness the transformative power of AI while minimizing its impact on the planet.

Explore our latest TPU offerings and learn more about how customers can unlock sustainable growth with Google Cloud.

^{1. The authors would like to thank and acknowledge the co-authors for their important contributions: Ian Schneider, Hui Xu, Stephan Benecke, Tim Huang, and Cooper Elsworth.
2. A February 2025 Google case study quantified the full lifecycle emissions of TPU hardware as a point-in-time snapshot across Google’s generations of TPUs. To estimate operational emissions from electricity consumption of running workloads, we used a one month sample of observed machine power data from our entire TPU fleet, applying Google’s 2023 average fleetwide carbon intensity. To estimate embodied emissions from manufacturing, transportation, and retirement, we performed a life-cycle assessment of the hardware. Data center construction emissions were estimated based on Google’s disclosed 2023 carbon footprint. These findings do not represent model-level emissions, nor are they a complete quantification of Google’s AI emissions. Based on the TPU location of a specific workload, CCI results of specific workloads may vary.
3. CCI includes both estimates of lifetime embodied and operational emissions in order to understand the impact of improved chip design on our TPUs. In this study, we hold the impact of carbon-free energy on carbon intensity constant across generations, by using Google's 2023 average fleetwide carbon intensity. We did this purposefully to remove the impact of deployment location on the results.}

Speed, scale and reliability: 25 years of Google data-center networking evolution

Wed, 30 Oct 2024 16:00:00 +0000

Rome wasn’t built in a day, and neither was Google’s network. But 25 years in, we’ve built out network infrastructure with scale and technical sophistication that’s nothing short of remarkable.

It’s all the more impressive because in the beginning, Google’s network infrastructure was relatively simple. But as our user base and the demand for our services grew exponentially, we realized that we needed a network that could handle an unprecedented scale of data and traffic, and that could adapt to dynamic traffic patterns as our workloads changed over time. This ignited a 25-year journey marked by numerous engineering innovations and milestones, ultimately leading to our current fifth-generation Jupiter data center network architecture, which now scales to 13 Petabits/sec of bisectional bandwidth. To put this data rate in perspective, this network could support a video call (@1.5 Mb/s) for all 8 billion people on Earth!

Today, we have hundreds of Jupiter fabrics deployed around the world, simultaneously supporting hundreds of services, billions of active daily users, all of our Google Cloud customers, and some of the largest ML training and serving infrastructures in the world. I would like to share more about our journey as we look ahead to the next generation of data center network infrastructure.

aside_block: <ListValue: [StructValue([('title', '$300 to try Google Cloud networking'), ('body', <wagtail.rich_text.RichText object at 0x7fb83fc82c10>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/products?#networking'), ('image', None)])]>

Guiding principles

Our network evolution has been guided by a few key principles:

Anything, anywhere: Our data center networks support efficiency and simplicity by allowing large-scale jobs to be placed anywhere among 100k+ servers within the same network fabric, with high-speed access to needed storage and support services. This scale improves application performance for internal and external workloads and eliminates internal fragmentation.
Predictable, low latency: We prioritize consistent performance and minimizing tail latency by provisioning bandwidth headroom, maintaining 99.999% network availability, and proactively managing congestion through end-host and fabric cooperation.
Software-defined and systems-centric: Leveraging software-defined networking (SDN) for flexibility and agility, we qualify and globally release dozens of new features every two weeks across our global network.
Incremental evolution and dynamic topology: Incremental evolution helps us to refresh the network granularly (rather than bringing it down wholesale), while dynamic topology helps us to continuously adapt to changing workload demands. The combination of optical circuit switching and SDN supports in-place physical upgrades and an ever-evolving, heterogeneous network that supports multiple hardware generations in a single fabric.
Traffic engineering and application-centric QoS: Optimizing traffic flows and ensuring Quality of Service helps us tailor the network to each application's needs.

Integrating across the above principles is the foundation for our work. The network is the foundation of reliability for all other compute services, from storage to AI. As such, the network must fail last and fail least. To support this foundational responsibility, we rigorously define and monitor every bad minute¹ across hundreds of clusters and millions of ports across our global network. Our progress on reliability is such that our in-house, software-defined Jupiter networks deliver a factor of 50x more reliability than prior versions of our data center networks.

2015 - Jupiter, the first Petabit network

In a seminal paper, we showed that Jupiter data center networks scaled to 1.3 Pb/s of aggregate bandwidth by leveraging merchant switch silicon, Clos topologies and Software Defined Networking (SDN). This generation of Jupiter was the culmination of five generations of data center networks developed in house by the Google networking team. At that time, this data rate — in one Google data center — was more than the estimated aggregate IP traffic data rate for the global internet.

2022 - Enabling 6 Petabit per second

In 2022 we announced that our Jupiter networks scaled to over 6 Pb/s, with deep integration of optical circuit switching (OCS), wave division multiplexing (WDM), and a highly scalable Orion SDN controller. These technologies unlocked a range of advancements, including incremental network builds, enhanced performance, reduced costs, lower power consumption, dynamic traffic management, and seamless upgrades.

2023 - 13 Petabit per second network

We have further enhanced Jupiter to support native 400 Gb/s link speeds in the network core. The fundamental building block of Jupiter networks (called the aggregation block) now consists of 512 ports of 400 Gb/s of connectivity both to end hosts and to the rest of the data center, for an aggregate of 204.8 Tb/s of bidirectional non-blocking bandwidth per block. We support 64 such blocks for a total bisection bandwidth of 64*204.8 Tb/s = 13.1 Pb/s. This technology has been powering Google's production data centers for over a year, fueling the rapid advancement of artificial intelligence, machine learning, web search, and other data-intensive applications.

2024 and beyond - Extreme networking in the age of AI

While celebrating over two decades of innovation in data center networking, we’re already charting the course for the next generation of network infrastructure to support the age of AI. For example, our teams are busy working on networking infrastructure needs for our upcoming A3 Ultra VMs, that feature NVIDIA ConnectX-7 networking, supports non-blocking 3.2 Tbps per server of GPU-to-GPU traffic over RoCE (RDMA over converged ethernet) and our future offerings based on NVIDIA GB200 NVL72.

Over the next few years, we will deliver significant advances in network scale and bandwidth, both per-port and network-wide. We will continue to push the boundaries of end-host integration, including the transport and congestion control stack, and streamline network stages to achieve even lower latency with tighter tails. Real-time topology engineering, deeper integration with the compute and storage stacks, and continued refinements to host-based load balancing techniques will further enhance network reliability and latency. With these innovations, our network will remain a cornerstone for the transformative applications and services that enrich the lives of our users throughout the world while simultaneously supporting the groundbreaking AI capabilities that power both our internal services and Google Cloud products.

We are excited to take on these challenges and opportunities to see what the next 25 years hold for Google networking!

Further resources

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network, SIGCOMM ‘15 [paper]

Journey of the first Jupiter datacenter network leveraging merchant switch silicon, Clos topologies and Software Defined Networking (SDN).
First deployed in production in 2012.

Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale, arxiv.org, 2022 [paper]

First deployed in production in 2013.

Orion: Google's Software-Defined Networking Control Plane. NSDI ‘21 [paper]

Google's high-performance, scalable, intent-based distributed SDN platform used in both datacenter and wide area networks.
First deployed in production in 2016.

Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking, SIGCOMM ’22 [paper]

Enabling technologies: OCS (2013), Orion SDN (2016), 200Gbps networking (2020), direct-connect topology (2017), dynamic traffic engineering (2018), dynamic topology engineering (2021).

Swift: Delay is Simple and Effective for Congestion Control in the Datacenter, SIGCOMM ‘20 [paper]

Swift, a congestion control protocol using hardware timestamps and AIMD control with a delay target, delivers excellent performance in Google datacenters with low flow completion times for short RPCs and high throughput for long RPCs.
First deployed in production in 2017

PLB: Congestion Signals are Simple and Effective for Network Load Balancing, SIGCOMM ‘22 [paper]

Protective Load Balancing (PLB) is a simple, effective host-based load balancing design that reduces network congestion and improves performance by randomly changing paths for congested connections, preferring to repath after idle periods to minimize packet reordering.
First deployed in production in 2020

^{1. Any minute where a statistically significant number of network flows in the data center network experience a total or partial outage above a defined threshold.}

Sustainable silicon to intelligent clouds: collaborating for the future of computing

Tue, 15 Oct 2024 18:00:00 +0000

Editor’s note: Today, we hear from Parthasarathy Ranganathan, Google VP and Technical Fellow and Amber Huffman, Principal Engineer. Partha delivered a keynote address today at the 2024 OCP Global Summit, an annual conference for leaders, researchers, and pioneers in the open hardware industry. Amber is on the board of directors at the Open Compute Project (OCP). Read on to hear about the past and future of hyperscale computing, and an overview of all of our activities in the OCP community.

We are in an exciting era of hyperscale computing, one where a new wave of innovations is building the foundation for AI/ML computing in the cloud. Building on Google’s rich 25-year history in hyperscale computing, we look ahead to how co-design and collaboration — across the hardware-software stack, disciplines, and communities — will be key to this exciting new future.

From scrappy beginnings to societal infrastructure

When Google was founded in 1998, it was clear that successful web search would require enormous amounts of computing power and storage. This led to the design of the very first hyperscale computers specialized for search. These early makeshift systems included creative cost-reduction approaches like corkboard servers and off-the-shelf fans from Walmart, and they set the stage for the hardware-software co-design and workload-specific specialization principles that we follow to this day.

Building on these first systems, over the subsequent decade, Google laid the groundwork for modern hyperscale computing, pioneering custom servers, custom networking, and custom data centers, and expanding our services beyond search to include Gmail, YouTube, and Android. All of this presaged the modern multi-workload cloud. During this period, we also developed essential systems software like Borg, Colossus, MapReduce, and Bigtable. In the following years, we focused on scaling these systems, while also prioritizing security, reliability, and power efficiency. The formation of the Open Compute Project (OCP) in 2011 marked the transition of hyperscale computing from niche discipline to more mainstream offering. In the current decade, hyperscale computing is characterized by innovations to counter the slowing of Moore’s law: specialized hardware to support machine learning and video processing as well as software-defined servers to manage heterogeneity.

Today, hyperscale computing has truly come into its own, evolving into the crucial societal infrastructure that drives cloud and AI workloads.

Cross-disciplinary co-design: the heart of innovation

Across all these Google innovations over the past 25 years, one theme has remained constant: a strong commitment to cross-disciplinary systems innovation and co-design. Looking ahead to the AI era, we continue to take a holistic approach: from “mud to cloud” — starting at the very ground on which we build our data centers up to to broader cloud computing services; and from “chip to ship” — designing hardware that we then deploy and use in production. This philosophy has driven some incredible efficiency gains, delivering orders-of-magnitude improvements across multiple generations of systems.

Take our Tensor Processing Units (TPUs). Multiple generations of these purpose-built AI accelerators (including our latest Trillium TPU) have driven significant advances in machine learning, including large-language models like Gemini and Nobel-prize-winning scientific breakthroughs like AlphaFold. However, we’ve gone beyond just chip design to considering the entire system that surrounds them. We've coupled TPUs with innovations like liquid cooling, advanced networking systems featuring cutting-edge optics and topology awareness, and a commitment to sustainable power, all in the service of creating a truly amazing AI platform. We've then layered open software frameworks like JAX, TensorFlow, OpenXLA, and Kubernetes on top of this hardware foundation, creating what we call the AI Hypercomputer. This hypercomputer is further enhanced by integrating with model gardens and applications, creating a vertically integrated ecosystem that's optimized for AI workloads.

Cross-industry collaboration: from ideas to impact

But there’s also another aspect of holistic co-design that has served us well: cross-industry collaborations, i.e., building standards and ecosystems. Our partnership with OCP is an important example of this. Since formally joining OCP in 2016, we’ve continued to grow our contributions year after year. Looking ahead, we want to highlight progress and opportunities in four key areas.

Sustainability
Last year, Google, along with fellow hyperscalers, rallied the industry to reduce carbon emissions with an ambitious roadmap towards greener concrete. We have since made good progress, collaborating to develop new metrics and benchmarks, identifying streamlined data center designs that minimize concrete use, and even using AI to research new materials. At a recent event, we demonstrated proof-of-concept concrete mixtures that can reduce carbon emissions by 20% to 40%.

As we work towards net-zero emissions by 2030 across our operations and value chain, there’s a lot more we can do. At OCP this year, we are discussing how to develop product category rules (PCRs) to accurately measure hardware emissions across the lifecycle, make more high-quality carbon data available, and develop clean reliable power backup for our data centers. Further, we’re continuing to look holistically at all aspects of our energy consumption, carbon footprint, and water usage.

Trusted silicon
Trusted silicon is a foundational element of hyperscaler systems. Over the past three years, we have collaborated on Caliptra, a re-usable IP block for root-of-trust management, and delivered an open-source implementation of Caliptra 1.0 that is being integrated by companies across the ecosystem. Google's future TPUs and ARM SoCs will also include Caliptra. Leveraging Caliptra, the OCP L.O.C.K. project will provide layered open-source cryptographic key management for storage devices, improving both trust and sustainability.

In the area of silicon reliability, we are continuing our industry-academia collaborations around a systems approach to addressing silicon faults and silent data errors, including funding six leading academic institutions for novel research. The Server Component Resilience (SDC) Specification discusses the opportunities ahead with standardized information exchange and test metrics and open frameworks for detecting and mitigating errors.

AI accelerators
AI represents a fundamental platform shift requiring us to innovate across hardware and software. Google has played an active role in driving standardization efforts for AI accelerators, particularly in areas like low-precision data formats (e.g., OCP FP8 and MX), software frameworks (e.g., OpenXLA, JAX, TensorFlow), and networking (Falcon, Ultra Ethernet, Ultra Accelerator Link). Working with other hyperscalers and GPU suppliers, we have also aligned on common specifications for firmware updates, management interfaces, and RAS (reliability, availability, serviceability).

But as AI continues to drive exponential demands on computing, we can do more. As part of the OCP AI Strategic initiative, we are sharing learnings from deploying over 1 GW of liquid cooled infrastructure to help the industry scale this capability. We are also identifying new power-delivery solutions, from chips to racks to data centers. Notably, akin to how Google led the industry with 48V racks, at OCP Summit this year, we are proposing 400V DC distribution and rack solutions that can significantly improve data center density and efficiency.

Systems infrastructure
Finally, we continue to make great progress on foundational systems infrastructure. Google's contributions this past year span contributions to NVM Express for the data center (e.g., security enhancements, open test repositories), servers (e.g., OpenTitan platform root of trust), and networking (Falcon, SONiC advancements in telemetry and simulation, advanced PCIe enclosure compatible form factor), as well as new efforts such as the open-source random shock and vibration testing. At the same time, we’ve gone beyond technical contributions to form and co-chair the OCP Advisory Board as well as guide the formation of the OCP AI Strategic Initiative.

Looking ahead, we will continue to keep innovating in this space, particularly to meet the next level of scale required by AI infrastructure. Notably, at the OCP Summit this year, we are discussing the adoption of robotics and automation for data centers. Across a range of activities (material movement, monitoring/inspection, servicing/repair, media management), robotics enable data center operations to scale safely and sustainably, and present a fundamental shift in how we build these facilities.

Innovating for the new intelligence revolution

We have a lot to be proud of over the past 25 years of hyperscale computing, but the best is yet to come. With AI, we are at an exciting inflection point in computing: the beginning of the new intelligence revolution. Akin to prior shifts — the industrial revolution for manufacturing or the information revolution with the mobile internet — this revolution will have a profound impact on both technology and society, and holistic system innovations will be key to enabling it. We look forward to collaborating with all of you on this exciting journey.

Advancing systems research: Synthesized Google storage I/O traces now available to the community

Tue, 25 Jun 2024 16:00:00 +0000

Designing large-scale distributed storage systems is a complex challenge, requiring deep insights into how storage hardware and software interact under real-world conditions. To empower researchers in this field, we recently released a collection of synthesized Google I/O traces for storage servers and disks. This release accompanies our paper, "Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples," published at ASPLOS 2024.

What are I/O traces and why do they matter?

I/O traces are records of the input/output operations happening on storage devices and servers, and are crucial for understanding real-world storage behavior and performance. Representative I/O traces that capture the diverse patterns and demands of exascale data centers (such as Google’s) are especially valuable. By studying these traces, researchers can:

Gain deeper insights into storage system performance and bottlenecks
Build more accurate models and simulate realistic workloads
Develop targeted optimizations for more efficient and reliable storage systems

But obtaining high-quality I/O traces is challenging due to storage-system heterogeneity and the need to capture details while minimizing overhead. To address these issues, we developed a novel methodology called Thesios.

Introducing Thesios: a methodology for I/O trace synthesis

We developed the Thesios methodology to create accurate and representative I/O traces. Thesios achieves this by combining down-sampled I/O traces (which are routinely collected in Google's data centers) from multiple disks across multiple storage servers.

Thesios synthesizes a full-resolution I/O trace for a single disk by combining I/O samples from multiple independent disks.

Thesios requires (1) a sampling service that collects I/O samples, (2) an entity identifier that identifies similar disks, (3) a trace synthesizer that generates server-level traces by combining samples, and (4) a trace reorganizer that adjusts request ordering and latency to produce a disk-level trace.

The challenge? Storage systems are internally heterogeneous, so naively combining samples collected from disks varying in model, size, utilization, and other aspects will not result in a representative trace. Thesios intelligently accounts for this diversity, helping to ensure that the synthesized traces accurately reflect real-world conditions. Our results show remarkable accuracy relative to actual aggregated statistics that we’ve collected:

95-99.5% accuracy in read/write request numbers
90-97% accuracy in utilization
80-99.8% accuracy in read latency

Total number of read operations and breakdown by latency-sensitive (L), throughput-oriented (TP) and other (O) requests of synthesized traces vs. the actual statistics. The traces synthesized by Thesios faithfully capture the fluctuation across days of the week, and hours of the day.

A unique capability of Thesios is the ability to synthesize counterfactual I/O traces for conducting data-driven “what-if'' studies. In our paper, we demonstrate how Thesios enables diverse counterfactual I/O-trace synthesis and analyses of hypothetical policy, hardware, and server changes via four example case studies:

Synthesizing I/O traces for disks with hypothetical capacities, utilization, and fullness
Experimenting with data segregation to form hot and cold disks by using different workload filtering criteria and analyzing the data segregation’s impacts on power consumption
Evaluating the impact on energy consumption and latency of deploying a low rotations-per-minute (RPM) disk
Estimating the impact on cache hits of increasing buffer cache size on a server

Why open-source these traces?

We have released two-month-long synthesized representative traces from three different Google storage clusters, containing approximately 2.5 billion I/O records. These traces include I/O operations from both user-facing and internal applications. Our goal is to fuel storage-systems research by sharing realistic workloads that we encountered in our large-scale data centers. We hope these traces will:

Inspire new optimizations and innovations in storage technology
Enable more accurate simulations and modeling of large-scale storage systems
Serve as a model for how industry can securely share production traces with academia, fostering collaboration and progress

We invite systems researchers to explore our Google I/O traces. We believe these traces offer a unique opportunity to delve into the complex world of large-scale storage and drive meaningful advancements. Download the traces and start your research today!

For a deeper dive into our methodology and the technical details, we encourage you to read our ASPLOS paper: Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples.

^{The research in this post describes joint work with our colleagues Soroush Ghodrati, Selene Moon, and Martin Maas. We also extend special thanks to Larry Greenfield, Mustafa Uysal, Arif Merchant, Seth Pollen, and Partha Ranganathan for their help and feedback on the trace release.}

Announcing Trillium, the sixth generation of Google Cloud TPU

Tue, 14 May 2024 18:05:00 +0000

Generative AI is transforming how we interact with technology while simultaneously opening tremendous efficiency opportunities for business impact. But these advances require ever greater compute, memory, and communication to train and fine tune the most capable models and to serve them interactively to a global user population. For more than a decade, we at Google have been developing custom AI-specific hardware, Tensor Processing Units, or TPUs, to push forward the frontier of what is possible in scale and efficiency.

This hardware supported a number of the innovations we announced today at Google I/O, including new models like Gemini 1.5 Flash, Imagen 3, and Gemma 2; all of these models have been trained on and are served using TPUs. To deliver the next frontier of models and enable you to do the same, we’re excited to announce Trillium, our sixth-generation TPU, the most performant and most energy-efficient TPU to date.

Trillium TPUs achieve an impressive 4.7X increase in peak compute performance per chip compared to TPU v5e. We doubled the High Bandwidth Memory (HBM) capacity and bandwidth, and also doubled the Interchip Interconnect (ICI) bandwidth over TPU v5e. Additionally, Trillium is equipped with third-generation SparseCore, a specialized accelerator for processing ultra-large embeddings common in advanced ranking and recommendation workloads. Trillium TPUs make it possible to train the next wave of foundation models faster and serve those models with reduced latency and lower cost. Critically, our sixth-generation TPUs are also our most sustainable: Trillium TPUs are over 67% more energy-efficient than TPU v5e.

Trillium can scale up to 256 TPUs in a single high-bandwidth, low-latency pod. Beyond this pod-level scalability, with multislice technology and Titanium Intelligence Processing Units (IPUs), Trillium TPUs can scale to hundreds of pods, connecting tens of thousands of chips in a building-scale supercomputer interconnected by a multi-petabit-per-second datacenter network.

The next phase of AI innovation with Trillium

More than a decade ago, Google recognized the need for a first-of-its-kind chip for machine learning. In 2013, we began work on the world’s first purpose-built AI accelerator, TPU v1, followed by the first Cloud TPU in 2017. Without TPUs, many of Google’s most popular services — such as real-time voice search, photo object recognition, and interactive language translation, along with the state-of-the-art foundation models such as Gemini, Imagen, and Gemma — would not be possible. In fact, the scale and efficiency of TPUs enabled foundational work on Transformers in Google Research, the algorithmic underpinnings of modern generative AI.

4.7X increase in compute performance per Trillium chip

TPUs were designed from the ground up for neural networks, and we’re always working to improve training and serving times for AI workloads. Trillium achieves 4.7X peak compute per chip compared to TPU v5e. To achieve this level of performance, we’ve expanded the size of matrix multiply units (MXUs) and increased the clock speed. Additionally, SparseCores accelerate embedding-heavy workloads by strategically offloading random and fine-grained access from TensorCores.

2X ICI and High Bandwidth Memory (HBM) capacity and bandwidth

Doubling the HBM capacity and bandwidth allows Trillium to work with larger models with more weights and larger key-value caches. Next-generation HBM enables higher memory bandwidth, improved power efficiency, and a flexible channel architecture to increase memory throughput. This improves training time and serving latency for large models. That’s twice the model weights and key-value caches, accessed faster and with more compute capacity for accelerating ML workloads. Doubling the ICI bandwidth enables training and inference jobs to scale to tens of thousands of chips powered by a strategic combination of custom optical ICI interconnects with 256 chips in a pod and Google Jupiter Networking that extends scalability to hundreds of pods in a cluster.

Trillium will power the next generation of AI models

Trillium TPUs will power the next wave of AI models and agents, and we’re looking forward to helping enable our customers with these advanced capabilities. For example, Essential AI’s mission is to deepen the partnership between humans and computers, and is looking forward to using Trillium to reinvent how businesses operate. Nuro is dedicated to creating a better everyday life through robotics by training their models with Cloud TPUs; Deep Genomics is powering the future of drug discovery with AI and looking forward to how their next foundational model, powered by Trillium, will change the lives of patients; and Deloitte, Google Cloud Partner of the Year for AI, will offer Trillium to transform businesses with generative AI. Support for training and serving of long-context, multimodal models on Trillium TPUs will also enable Google DeepMind to train and serve the future generations of Gemini models faster, more efficiently, and with lower latency than ever before

Jeff Dean
Blog-post-5
Andrew Clare
Brendan Frey
Matt Lacey

Trillium and AI Hypercomputer

Trillium TPUs are a part of Google Cloud's AI Hypercomputer, a groundbreaking supercomputing architecture designed specifically for cutting-edge AI workloads. It integrates performance-optimized infrastructure (including Trillium TPUs), open-source software frameworks, and flexible consumption models. Our commitment to open-source libraries like JAX, PyTorch/XLA, and Keras 3 empowers developers. Support for JAX and XLA means that declarative model description written for any previous generation of TPUs maps directly to the new hardware and network capabilities of Trillium TPUs. We've also partnered with Hugging Face on Optimum-TPU for streamlined model training and serving.

“Our partnership with Google Cloud makes it easier for Hugging Face users to fine-tune and run open models on Google Cloud’s AI infrastructure, including TPUs. We are excited to further accelerate open source AI with the upcoming sixth-generation Trillium TPUs, and we expect open models to continue to deliver optimal performance thanks to the 4.7X increase in performance per chip compared to the previous generation. We will make the performance of Trillium easily available to all AI builders through our new Optimum-TPU library!" - Jeff Boudier, Head of Product, Hugging Face

SADA (An Insight Company) has been Partner of the Year each year since 2017 and delivers Google Cloud Services for maximum impact.

As a proud Google Cloud Premier Partner, SADA has a 20-year long history with the world’s established AI pioneer. We are rapidly integrating AI for thousands of diverse customers. With our depth of experience and the AI Hypercomputer architecture, we can't wait to help our customers unlock the value of this next frontier of generative AI models with Trillium. - Miles Ward, CTO, SADA

AI Hypercomputer also offers the flexible consumption models required for AI/ML workloads. Dynamic Workload Scheduler (DWS) makes it easier to access AI/ML resources and helps customers optimize their spend. Flex start mode can improve the experience of bursty workloads such as training, fine-tuning, or batch jobs, by scheduling all the accelerators needed simultaneously, regardless of your entry point: Vertex AI Training, Google Kubernetes Engine (GKE) or Google Cloud Compute Engine.

Lightricks is excited to gain value back with the increase in performance coupled with the efficiency gain from AI Hypercomputer.

“We’ve been using TPUs for our text-to-image and text-to-video models since Cloud TPU v4. With TPU v5p and AI Hypercomputer efficiencies, we achieved a whopping 2.5X increase in training speed! The 6th generation of Trillium TPUs are incredible with a 4.7X increased compute performance per chip and 2X HBM Capacity and Bandwidth improvement over the previous generation. This came just in time for us as we scale our text-to-video models. We’re also looking forward to using Dynamic Workload Scheduler’s flex start mode to manage our batch inference jobs and to manage our future TPU reservations.” - Yoav HaCohen, PhD, Core Generative AI Research Team Lead, Lightricks

Learn more about Google Cloud Trillium TPUs

Google Cloud TPUs are the cutting-edge of AI acceleration, custom-designed and optimized to empower large-scale artificial intelligence models. Exclusively available through Google Cloud, TPUs deliver unparalleled performance and cost-efficiency for training and serving AI solutions. Whether it's the complex intricacies of large language models or the creative potential of image generation, TPUs help enable developers and researchers to push the boundaries of what's possible in the world of artificial intelligence.

The sixth-generation Trillium TPUs are a culmination of over a decade of research and innovation and will be available later this year. To learn more about Trillium TPUs and AI Hypercomputer, please complete this form and our sales team will be in touch.

Caliptra: Building trust, one chip at a time

Tue, 23 Apr 2024 20:00:00 +0000

At Google, we build sustainable, secure, and scalable hardware and software to enable services that support billions of users. We have embraced open innovation as a core tenet to deliver these experiences. Our society’s AI-driven future includes many types of system-on-chips (SoCs) acting in concert with each other — from CPUs to GPUs to TPUs to NICs to SSDs and more. To deliver secure solutions at scale, there must be trust and transparency for the firmware that runs on all of these chips.

Welcoming Caliptra 1.0

Google partnered with AMD, Microsoft, and NVIDIA to develop Caliptra, a standard at the Open Compute Project (OCP) to raise the bar on security for chips. Caliptra is a hardware root-of-trust (RoT) that provides verifiable cryptographic assurances to help ensure that only recognized and trusted firmware is allowed to run production workloads.

Caliptra’s initial focus is on hardware implementations used in confidential computing, and, over time, will extend to all chips. To address the increasingly sophisticated nature of cyberattacks, the team went beyond a written specification to deliver an open-source implementation at the CHIPS Alliance. The result is a silicon-level intellectual property (IP) block for integration into future chips, including CPUs, GPUs, and SSDs. The Caliptra source code also covers the block’s ROM and firmware.

We are pleased to announce that the Caliptra specification and open-source hardware and software implementation is complete, reaching the revision 1.0 milestone. The Caliptra community continues to grow and now includes 9elements, AMI, Antmicro, ASPEED, Axiado, Lubis EDA, ScaleFlux, Marvell and Nuvoton, who together have significant domain expertise across SoC design automation, firmware, and verification.

The Caliptra IP block is currently being integrated by companies across the ecosystem into chips that will start to appear in the market in 2026. In less than two years, we have gone from project inception to a complete specification and open-source implementation of the hardware and software.

The team is already working on the next iteration with Caliptra 2.0, which will tackle quantum cryptography to comply with NIST’s recommendations for module-lattice-based digital signatures and stateful hash-based signature schemes. Download the Caliptra 1.0 specification and access the open source repositories at caliptra.io.

OCP S.A.F.E.

Google, Microsoft, and OCP are also engaged in a complementary effort to raise the bar on security assessments: OCP Security Appraisal Framework for Enablement (OCP S.A.F.E.). This program provides security conformance assurance to consumers of devices such as SSDs. The program has certified a list of approved OCP Security Review Providers (SRPs) who conduct security conformance reviews to ensure the provenance, code quality, and software supply chain for firmware releases and patches for devices, while protecting the intellectual property of the device vendors. You can learn more about OCP’s S.A.F.E. program here.

What’s to come

Already, Caliptra has emerged as a high-quality specification and implementation that addresses security of a complex problem. And we’re following up on it with a new initiative called OCP Layered Open-source Cryptographic Key-management (OCP L.O.C.K.) Established by Google, Microsoft, Samsung, Solidigm and KIOXIA, OCP L.O.C.K. defines and implements a standard for NVM Express (NVMe) key management block to protect customer data even if a physical drive is stolen from a data center.

It’s energizing to unite with industry leaders to deliver technology that will make society’s infrastructure more trustworthy and secure, using open source as a mechanism to help the hardware, firmware, and software achieve the standard’s objectives in a transparent and auditable manner. You can learn more about Caliptra, OCP S.A.F.E., and OCP L.O.C.K. at the OCP Regional Summit this week in Lisbon, Portugal. We are looking forward to discussing these technologies and inventing the future together.

What’s new with Google Cloud’s AI Hypercomputer architecture

Tue, 09 Apr 2024 12:00:00 +0000

Advancements in AI are unlocking use-cases previously thought impossible. Larger and more complex AI models are enabling powerful capabilities across a full range of applications involving text, code, images, videos, voice, music, and more. As a result, leveraging AI has become an innovation imperative for businesses and organizations around the world, with the potential to boost human potential and productivity.

However, the AI workloads powering these exciting use-cases place incredible demands on the underlying compute, networking, and storage infrastructure. And that’s only one aspect of the architecture: customers also face the challenge of integrating open-source software, frameworks, and data platforms, while optimizing for resource consumption to harness the power of AI cost-effectively. Historically, this has required manually combining component-level enhancements, which can lead to inefficiencies and bottlenecks.

That’s why today we’re pleased to announce significant enhancements at every layer of our AI Hypercomputer architecture. This systems-level approach combines performance-optimized hardware, open software and frameworks, and flexible consumption models to enable developers and businesses to be more productive, because the overall system runs with higher performance and effectiveness, and the models generated are served more efficiently.

In fact, just last month, Forrester Research recognized Google as a Leader in The Forrester Wave™: AI Infrastructure Solutions¹, Q1 2024, with the highest scores of any vendor evaluated in both the Current Offering and Strategy categories in this report.

The announcements we’re making today span every layer of the AI Hypercomputer architecture:

Performance-optimized hardware enhancements including the general availability of Cloud TPU v5p, and A3 Mega VMs powered by NVIDIA H100 Tensor Core GPUs, with higher performance for large-scale training with enhanced networking capabilities
Storage portfolio optimizations for AI workloads including Hyperdisk ML, a new block storage service optimized for AI inference/serving workloads, and new caching capabilities in Cloud Storage FUSE and Parallelstore, which improve training and inferencing throughput and latency
Open software advancements including the introduction of JetStream — a throughput- and memory-optimized inference engine for large language models (LLMs) that offers higher performance per dollar on open models like Gemma 7B, and JAX and PyTorch/XLA releases that improve performance on both Cloud TPUs and NVIDIA GPUs
New flexible consumption options with Dynamic Workload Scheduler, including calendar mode for start time assurance, and flex start mode for optimized economics

Learn more about AI Hypercomputer with a rare look inside one of our data centers:

Advances in performance-optimized hardware

Cloud TPU v5p GA
We’re thrilled to announce the general availability of Cloud TPU v5p, our most powerful and scalable TPU to date. TPU v5p is a next-generation accelerator that is purpose-built to train some of the largest and most demanding generative AI models. A single TPU v5p pod contains 8,960 chips that run in unison — over 2x the chips in a TPU v4 pod. Beyond the larger scale, TPU v5p also delivers over 2x higher FLOPS and 3x more high-bandwidth memory on a per chip basis. It also delivers near-linear improvement in throughput as customers use larger slices, achieving 11.97X throughput for a 12x increase in slice size (from 512 to 6144 chips).

Comprehensive GKE support for TPU v5p
To enable training and serving the largest AI models on GKE across large-scale TPU clusters, today we’re also announcing the general availability of both Google Kubernetes Engine (GKE) support for Cloud TPU v5p and TPU multi-host serving on GKE. TPU multi-host serving on GKE allows customers to manage a group of model servers deployed over multiple hosts as a single logical unit, so they can be managed and monitored centrally.

“By leveraging Google Cloud’s TPU v5p on Google Kubernetes Engine (GKE), Lightricks has achieved a remarkable 2.5X speed-up in training our text-to-image and text-to-video models compared to TPU v4. GKE ensures that we are able to smoothly leverage TPU v5p for the specific training jobs that need the performance boost.” - Yoav HaCohen, PhD, Core Generative AI Research Team Lead, Lightricks

Expanded NVIDIA H100 GPU capabilities with A3 Mega GA and Confidential Compute
We’re also expanding our NVIDIA GPU capabilities with additions to the A3 VM family, which now includes A3 Mega. A3 Mega, powered by NVIDIA H100 GPUs, will be generally available next month and offers double the GPU-to-GPU networking bandwidth of A3. Confidential Computing will also be coming to the A3 VM family, in preview later this year. Enabling confidential VMs on the A3 machine series protects the confidentiality and integrity of sensitive data and AI workloads and mitigates threats from unauthorized access. Enabling Confidential Computing on the A3 VM family encrypts the data transfers between the Intel TDX-enabled CPU and NVIDIA H100 GPU via protected PCIe, and requires no code changes.

Bringing NVIDIA Blackwell GPUs to Google Cloud
We also recently announced that we will be bringing NVIDIA’s newest Blackwell platform to our AI Hypercomputer architecture in two configurations. Google Cloud customers will have access to VMs powered by both the NVIDIA HGX B200 and GB200 NVL72 GPUs. The new VMs with HGX B200 GPU is designed for the most demanding AI, data analytics, and HPC workloads, while the upcoming VMs powered by the liquid-cooled GB200 NVL72 GPU will enable a new era of computing with real-time LLM inference and massive-scale training performance for trillion-parameter scale models.

Customers leveraging both Google Cloud TPU and GPU-based services
Character.AI is a powerful, direct-to-consumer AI computing platform where users can easily create and interact with a variety of characters. Character.AI is using Google Cloud’s AI Hypercomputer architecture across GPU- and TPU-based infrastructure to meet the needs of its rapidly growing community.

“Character.AI is using Google Cloud's Tensor Processor Units (TPUs) and A3 VMs running on NVIDIA H100 Tensor Core GPUs to train and infer LLMs faster and more efficiently. The optionality of GPUs and TPUs running on the powerful AI-first infrastructure makes Google Cloud our obvious choice as we scale to deliver new features and capabilities to millions of users. It’s exciting to see the innovation of next-generation accelerators in the overall AI landscape, including Google Cloud TPU v5e and A3 VMs with H100 GPUs. We expect both of these platforms to offer more than 2X more cost-efficient performance than their respective previous generations.” - Noam Shazeer, CEO, Character AI

Storage optimized for AI/ML workloads

To improve AI training, fine-tuning, and inference performance, we've added a number of enhancements to our storage products, including caching, which keeps the data closer to your compute instances, so you can train much faster. Each of these improvements also maximizes GPU and TPU utilization, leading to higher energy efficiency and cost optimization.

Cloud Storage FUSE (generally available) is a file-based interface for Google Cloud Storage that harnesses Cloud Storage capabilities for more complex AI/ML apps by providing file access to our high-performance, low-cost cloud storage solutions. Today we announced that new caching capabilities are generally available. Cloud Storge FUSE caching improves training throughput by 2.9X and improves serving performance for one of our own foundation models by 2.2X.

Parallelstore now also includes caching (in preview). Parallelstore is a high-performance parallel filesystem optimized for AI/ML and HPC workloads. New caching capabilities enable up to 3.9X faster training times and up to 3.7X higher training throughput, compared to native ML framework data loaders.

Filestore (generally available) is optimized for AI/ML models that require low latency, file-based data access. The network file system-based approach allows all GPUs and TPUs within a cluster to simultaneously access the same data, which improves training times by up to 56%, optimizing the performance of your AI workloads and boosting your most demanding AI projects.

We’re also pleased to introduce Hyperdisk ML in preview, our next-generation block storage service optimized for AI inference/serving workloads. Hyperdisk ML accelerates model load times up to 12X compared to common alternatives, and offers cost efficiency through read-only, multi-attach, and thin provisioning. It enables up to 2,500 instances to access the same volume and delivers up to 1.2 TiB/s of aggregate throughput per volume — over 100X greater performance than Microsoft Azure Ultra SSD and Amazon EBS io2 BlockExpress.

Advancements in our open software

Starting from frameworks and spanning the full software stack, we’re introducing open-source enhancements that enable customers to improve time-to-value for AI workloads by simplifying the developer experience while improving performance and cost efficiency.

JAX and high-performance reference implementations
We’re pleased to introduce MaxDiffusion, a new high-performance and scalable reference implementation for diffusion models. We’re also introducing new LLM models in MaxText, including Gemma, GPT3, LLAMA2 and Mistral across both Cloud TPUs and NVIDIA GPUs. Customers can jump-start their AI model development with these open-source implementations and then customize them further based on their needs.

MaxText and MaxDiffusion models are built on JAX, a cutting-edge framework for high-performance numerical computing and large-scale machine learning. JAX in turn is integrated with the OpenXLA compiler, which optimizes numerical functions and delivers excellent performance at scale, allowing model builders to focus on the math and let the software drive the most effective implementation. We’ve heavily optimized JAX and OpenXLA performance on Cloud TPU and also partnered closely with NVIDIA to optimize OpenXLA performance on large Cloud GPU clusters.

Advancing PyTorch support
As part of our commitment to PyTorch, support for PyTorch/XLA 2.3 will follow the upstream release later this month. PyTorch/XLA enables tens of thousands of PyTorch developers to get the best performance from XLA devices such as TPUs and GPUs, without having to learn a new framework. The new release brings features such as single program, multiple data (SPMD) auto-sharding, and asynchronous distributed checkpointing, making running a distributed training job much easier and more scalable.

And for PyTorch users in the Hugging Face community, we worked with Hugging Face to launch Optimum-TPU, a performance-optimized package that will help developers easily train and serve Hugging Face models on TPUs.

Jetstream: New LLM inference engine
We’re introducing Jetstream, an open-source, throughput- and memory-optimized LLM inference engine for XLA devices, starting with TPUs, that offers up to 3x higher inferences per dollar on Gemma 7B and other open models. As customers bring their AI workloads to production, there’s an increasing demand for a cost-efficient inference stack that delivers high performance. JetStream supports models trained with both JAX and PyTorch/XLA, and includes optimizations for popular open models such as Llama 2 and Gemma.

Open community models in collaboration with NVIDIA
Additionally, as part of the NVIDIA and Google collaboration with open community models, Google models will be available as NVIDIA NIM inference microservices to provide developers with an open, flexible platform to train and deploy using their preferred tools and frameworks.

New Dynamic Workload Scheduler modes

Dynamic Workload Scheduler is a resource management and job scheduling service that’s designed for AI workloads. Dynamic Workload Scheduler improves access to AI computing capacity and helps you optimize your spend for AI workloads by scheduling all the accelerators needed simultaneously, and for a guaranteed duration. Dynamic Workload Scheduler offers two modes: flex start mode (in preview) for enhanced obtainability with optimized economics, and calendar mode (in preview) for predictable job start times and durations.

Flex start jobs are cued to run as soon as possible, based on resource availability, making it easier to obtain TPU and GPU resources for jobs that have a flexible start time. Flex start mode is now integrated across Compute Engine Managed Instance Groups, Batch, and Vertex AI Custom Training, in addition to Google Kubernetes Engine (GKE). With flex start, you can now run thousands of AI/ML jobs with increased obtainability across the various TPU and GPU types offered in Google Cloud.

Calendar mode offers short-term reserved access to AI-optimized computing capacity. You can reserve collocated GPUs for up to 14 days, which can be purchased up to 8 weeks in advance. This new mode extends Compute Engine future reservation capabilities. Your reservations are confirmed, based on availability, and the capacity is delivered to your project on your requested start date. You can then simply create VMs targeting the capacity block for the entire duration of the reservation.

“Dynamic Workload Scheduler improved on-demand GPU obtainability by 80%, accelerating experiment iteration for our researchers. Leveraging the built-in Kueue and GKE integration, we were able to take advantage of new GPU capacity in Dynamic Workload Scheduler quickly and save months of development work.” - Alex Hays, Software Engineer, Two Sigma

AI anywhere with Google Distributed Cloud

The acceleration of AI adoption by enterprises has highlighted the need for flexible deployment options to process or securely analyze data closer to where it is generated. Google Distributed Cloud (GDC) brings the power of Google's cloud services wherever you need them — in your own data center or at the edge. Today we introduced several enhancements to GDC, including a generative AI search package solution powered by Gemma, an expanded ecosystem of partner solutions, new compliance certifications and more. Learn more about how to use GDC to run AI anywhere.

Our growing momentum with Google AI infrastructure

At Next this week we’re launching incredible AI innovation across everything from AI platforms and models to AI assistance with Gemini for Google Cloud — all underpinned by a foundation of AI-optimized infrastructure. All of this innovation is driving incredible momentum for our customers. In fact, nearly 90% of generative AI unicorns and more than 60% of funded gen AI startups are Google Cloud customers.

“Runway’s text-to-video platform is powered by AI Hypercomputer. At the base, A3 VMs, powered by NVIDIA H100 GPUs gave our training a significant performance boost over A2 VMs, enabling large-scale training and inference for our Gen-2 model. Using GKE to orchestrate our training jobs enables us to scale to thousands of H100s in a single fabric to meet our customers’ growing demand.” - Anastasis Germanidis, CTO and Co-Founder, Runway

"By moving to Google Cloud and leveraging the AI Hypercomputer architecture with G2 VMs powered by NVIDIA L4 GPUs and Triton Inference Server, we saw a significant boost in our model inference performance while lowering our hosting costs by 15% using novel techniques enabled by the flexibility that Google Cloud offers.” -Ashwin Kannan, Sr. Staff Machine Learning Engineer, Palo Alto Networks

"Writer's platform is powered by Google Cloud A3 and G2 VMs powered by NVIDIA H100 and L4 GPUs. With GKE we're able to efficiently train and inference over 17 large language models (LLMs) that scale up to over 70B parameters. We leverage Nvidia NeMo Framework to build our industrial strength models which generate 990,000 words a second with over a trillion API calls per month. We're delivering the highest quality inferencing models that exceed those from companies with larger teams and bigger budgets and all of that is possible with the Google and Nvidia partnership.” - Waseem Alshikh Cofounder and CTO, Writer

Learn more about AI Hypercomputer at the Next sessions below, and ask your sales representative about how you can apply these capabilities within your own organization.

SPTL205 - Workload-optimized and AI-powered Infrastructure
ARC108 - Take large scale AI from research to production with Google Cloud's AI Hypercomputer
IHLT303 - How Lightricks is powering generative image models with Cloud TPUs and AI Hypercomputer

^{1. Forrester Research, The Forrester Wave™: AI Infrastructure Solutions, Q1 2024, Mike Gualtieri, Sudha Maheshwari, Sarah Morana, Jen Barton, March 17, 2024}

^{The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Save is a graphical representation of Forrester’s call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave™. Information is based on the best available resources. Opinions reflect judgment at the time and are subject to change.}

Coming of age in the fifth epoch of distributed computing, accelerated by machine learning

Thu, 15 Feb 2024 15:00:00 +0000

Editor’s note: Today, we hear from Google Fellow Amin Vahdat, who is the VP & GM for ML, Systems, and Cloud AI at Google. Amin originally delivered this as a keynote in 2023 at the University of Washington for The Allen School's Distinguished Lecture Series. This post captures Amin’s reflections on the history of distributed computing, where we are today, and what we can expect for the next-generation of computing services.

Over the past fifty years, computing and communication have transformed society with sustained exponential growth in capacity, efficiency, and capability. Over that time, we have, as a community, delivered a 50-million-fold increase in transistor count per CPU and grown the Internet from 4 nodes to 5.39 billion.

While these advances are impressive, the human capabilities that result from these advances are even more compelling, sometimes bordering on what was previously the domain of science fiction. We now have near-instantaneous access to the evolving state of human knowledge, limited only by our ability to make sense of it. We can now perform real-time language translation, breaking down fundamental barriers to human communication. Commensurate improvements in sensing and network speeds are delivering real-time holographic projections that will begin to support meaningful interaction at a distance. This explosion in computing capability is also powering next-generation AI systems that are solving some of the hardest scientific and engineering challenges of our time, for example, predicting the 3D structure of a protein, almost instantly, down to atomic accuracy, or unlocking advanced text-to-image diffusion technology, delivering high-quality, photorealistic outputs that are consistent with a user’s prompt.

Maintaining the pace of underlying technological progress has not been easy. Every 10-15 years, we encounter fundamental challenges that require foundational inventions and breakthroughs to sustain the exponential growth of the efficiency and scale of our infrastructure, which in turn power entirely new categories of services. It is as if every factor of a thousand exposes some new fundamental, progressively more challenging limit that must be overcome and creates some transformative opportunity. We are in one of those watershed moments, a once-in-a-generation challenge and opportunity to maintain and accelerate the awe-inspiring rate of progress at a time when the underlying, seemingly insatiable demand for computing is only accelerating.

A look back on the brief history of computing suggests that we have worked through four such major transitions, each defining an ‘epoch’ of computing. We offer a historical taxonomy that points to a manifest need to define and to drive a fifth epoch of computing, one that is data-centric, declarative, outcome-oriented, software-defined, and centered on proactively bringing insights to people. While each previous epoch made the previously unimaginable routine, this fifth epoch will bring about the largest transformation thus far, promising to democratize access to knowledge and opportunity. But at the same time, it will require overcoming some of the most intrinsically difficult, and cross-stack challenges in computing.

We begin our look back at Epoch 0. Purists will correctly argue that we could look back thousands of years further, but we choose to start with some truly landmark and foundational developments in computer science that took place between 1947-1969, laying the basis for modern computing and communication.

1947: Bardeen, Brattain and Shockley invent the first working transistor.
1948: Shannon introduces Information Theory, the basis for all network communication.
1949: Stored programs in computers become operational.
1956: High-level programming languages are invented.
1964: Instruction Set Architectures, common across different hardware generations, emerge.
1965: Moore’s Law introduced, positing that transistor count per integrated circuit will double every 18-24 months.
1967: Multi-user operating systems provide protected sharing of resources.
1969: Introduction of the ARPANet, the basis for the modern Internet.

These breakthroughs became the basis for modern computing at the end of Epoch 0: four computers based on integrated circuits running stable instruction set architectures and a multi-user, time-shared operating system connected to a packet-switched internet. This seemingly humble baseline laid the foundation for exponential progress in subsequent epochs.

In the first Epoch, computer networks were largely used in an asynchronous manner: transfer data across the network (e.g., via FTP), operate on it, and then transfer results back.

Notable developments: SQL, FTP, email, and Telnet
Interaction time among computers: 100 milliseconds
Characteristics:
• Low-bandwidth, high-latency networks
• Rare pairwise interactions between expensive computers
• Character keystroke interactions with humans
• The emergence of open source software
Breakthrough: Personal computers

Aided by increasing network speeds, prevalence of personal computers/workstations, and widespread, interoperable protocols (IP, TCP, NFS, HTTP), synchronous, transparent computation and communication became widespread in Epoch 2.

Notable developments: Remote Procedure Call, client/server computing, LANs, leader election and consensus
Interaction time among computers: 10 milliseconds
Characteristics:
• 10 Mbps networks
• Internet Architecture scales globally thanks to TCP/IP
• Full 32-bit CPU fits on a chip
• Shared resources between multiple computers
Breakthrough: The World Wide Web

In Epoch 3, the true breakthrough of HTTP and the World Wide Web brought network computing to the masses, breaking the confines of personal computing. To keep pace with continued exponential growth in the Internet and the needs of a global user population, many of the design patterns of modern computing were established during this period.

One of the drivers of Epoch 3 was the end of Dennard scaling, which essentially limited the maximum clock frequency of a single CPU core. This limitation led the industry to adopt multi-core architectures, necessitating a move toward asynchronous, multi-threaded, and concurrent development environments.

Notable developments: HTTP, three-tier services, massive clusters, web search
Interaction time among computers: 1 millisecond
Characteristics:
• 100 Mbps–1Gbs networks
• Autonomous Systems / BGP
• Complex apps no longer fit on a single server; scaling to many servers
• Web indexing and search, population-scale email
Breakthrough: Cluster-based Internet services, mobile-first design, multithreading and instruction-level parallelism

Epoch 4 established planetary-scale services available to billions of people through ubiquitous cellular devices. In parallel, a renaissance in machine learning drove more real-time control and insights. All of this was powered by warehouse-scale clusters of commodity computers interconnected by high-speed networks, which together processed vast datasets in real-time.

Notable developments: Global cellular data coverage, planet-scale services, ubiquitous video
Interaction time among computers: 100 microseconds
Characteristics:
• 10-100 Gbps networks, flash
• Multiple cores per CPU socket
• Infrastructure that scales out across LANs (e.g., GFS, MapReduce, Hadoop)
• Mobile apps, global cellular data coverage
Breakthroughs: Mainstream machine learning, readily available specialized compute hardware, cloud computing.

Today, we have transitioned to the fifth Epoch, which is marked by a superposition of two opposing trends. First, while transistor count per ASIC continues to increase at exponential rates, clock rates are flat and the cost of each transistor is now nearly flat, both limited by the increasing complexity and investment required to achieve smaller feature sizes. The implication is that performance normalized to cost improvements, or performance efficiency, of all of compute, DRAM, storage, and network infrastructure, is flattening. At the same time, ubiquitous network coverage, broadly deployed sensors, and data-hungry machine learning applications are accelerating the demand for raw computing infrastructure exponentially.

Notable developments: Machine learning, generative AI, privacy, sustainability, societal infrastructure
Interaction time among computers: 10 microseconds
Featuring:
• 200Gbps–1+Tb/s networks
• Ubiquitous, power-efficient, and high-speed wireless network coverage
• Increasingly specialized accelerators: TPUs, GPUs, Smart NICs
• Socket-level fabrics, optics, federated architectures
• Connected spaces, vehicles, appliances, wearables, etc…
Breakthroughs: Many coming...

Without fundamental breakthroughs in computing design and organization, our ability as a community to meet societal demands for computing infrastructure will falter. Coming up with new architectures to overcome these limitations, new hardware and increasingly, software architectures, will define the fifth epoch of computing.

While we cannot predict the breakthroughs that will be delivered in this fifth epoch of computing, we do know that each previous epoch has been characterized by a factor of 100x improvement in scale, efficiency, and cost-performance, all while improving security and reliability. The demand for scale and capability is only increasing, so delivering such gains without the tailwinds of Moore’s Law and Dennard scaling at our backs will be daunting. We imagine, however, the broad strokes will involve:

Declarative programming models: The Von Neumann model of sequential code execution on a dedicated processor has been incredibly useful for developers for decades. However, the rise of distributed and multi-threaded computing has broken the abstraction to the point where much of modern imperative code focuses on defensive, and often inefficient, constructs to manage asynchrony, heterogeneity, tail latency, optimistic concurrency, and failures. Complexity will only increase in the years ahead, essentially requiring new declarative programming models focused on intent, the user, and business logic. At the same time, managing execution flow and responding to shifting deployment conditions will need to be delegated to increasingly sophisticated compilers and ML-powered runtimes.
Hardware segmentation: In earlier epochs, a general-purpose server architecture with a system balance of CPU, memory, storage, and networking could efficiently meet workload needs throughout the data center. However, when designing for specialized computing needs, ML training, inference, video processing, the conflicting requirements for storage, memory capacity, latency, bandwidth and communication is causing a proliferation of heterogeneous designs. When general-purpose compute performance was improving at 1.5x/year, pursuing even a 5x improvement for 10% of workloads did not make sense given the complexity. Today, such improvements can no longer be ignored. Addressing this gap will require new approaches to designing, verifying, qualifying, and deploying composable hardware ASICs and memory units in months, not years.
Software-defined infrastructure: As underlying infrastructure has become more complex and more distributed, multiple layers of virtualization from memory to CPU have maintained the single server abstraction for individual applications. This trend will continue in the coming epoch as infrastructure continues to scale out and become more heterogeneous. The corollary of hardware segmentation, declarative programming models and distributed computing environments comprised of thousands of servers, will stretch virtualization beyond the confines of individual servers to include distributed computing on a single server, multiple servers, storage/memory arrays, and clusters — in some cases bringing resources across an entire campus together to efficiently deliver end results.
Provably secure computation: In the last epoch, the need to sustain compute efficiency inadvertently came at the cost of security and reliability. However, as our lives move increasingly online, the need for privacy and confidentiality increases exponentially for individuals, for business, and governments. Data sovereignty, or the need to restrict the physical location of data, even derived, will become increasingly important to adhere to government policies, but also to transparently show the lineage of increasingly ML-generated content. Despite some cost in baseline performance, these needs must be first-class requirements and constraints.
Sustainability: The first three epochs of computing delivered exponential improvements in performance for fixed power. With the end of Dennard scaling in the fourth epoch, global power consumption associated with computing has grown quickly, partially offset by the move to cloud-hosted infrastructure, which is 2-3x more power-efficient relative to earlier, on-premises designs. Further, cloud providers have made broad commitments to move to first carbon-neutral and then carbon-free power sources. However, the demand for data and compute will continue to grow and even likely accelerate in the fifth epoch. This will turn power-efficiency and carbon emissions into primary systems-evaluation metrics. Of particular note, embodied carbon over the entire lifecycle of infrastructure build and delivery will require both improved visibility and optimization.
Algorithmic innovation: The tailwinds of exponentially increasing performance have allowed software efficiency improvements to often go neglected. As improvement in underlying hardware components slows, the focus will turn to software and algorithmic opportunities. Studies indicate that opportunities for 2-10x improvement in software optimization abound in systems code. Efficiently identifying these software optimization opportunities and developing techniques to gracefully and reliably deliver these benefits to production systems at scale will be a critical opportunity. Leveraging recent breakthroughs in coding LLMs to partially automate this work would be a significant accelerant in the fifth epoch.

Integrating across the above, the fifth epoch will be ruled by measures of overall user-system efficiency (useful answers per second) rather than lower-level per-component measures such as cost per MIPS, cost per GB of DRAM, cost per Gb/s, etc. Further, the units of efficiency will not be simply measured in performance-per-unit-cost but will explicitly account for power consumption and carbon emissions, and will take security and privacy as primary metrics, all while enforcing reliability requirements for the infrastructure on which society increasingly depends. Taken together, there are many untapped opportunities to deliver the next generation of infrastructure:

A greater than 10x opportunity in scale-out efficiency of our distributed infrastructure across hardware and software.
Another 10x opportunity in matching application balance points — that is, the ratio between different system resources such as compute, accelerators, memory, storage, and network — through software-defined infrastructure.
A more than 10x opportunity in next-generation accelerators and segment-specific hardware components relative to traditional one-size-fits-all, general-purpose computing architectures.
Finally, there is a hard-to-quantify but absolutely critical opportunity to improve developer productivity while simultaneously delivering substantially improved reliability and security.

Combining these trends, we are on the cusp of yet another dramatic 1000x efficiency gain over the next epoch that will define the next generation of infrastructure services and enable the next generation of computing services, likely centering around breakthroughs in multimodal models and generative AI. The opportunity to define, design, and deploy what computing means for the next generation does not come along very often, and the tectonic shifts in this fifth epoch promise perhaps the biggest technical transformations and challenges to date, requiring a level of responsibility, collaboration and vision perhaps not seen since the earliest days of computing.

Enabling next-generation AI workloads: Announcing TPU v5p and AI Hypercomputer

Wed, 06 Dec 2023 15:00:00 +0000

Generative AI (gen AI) models are rapidly evolving, offering unparalleled sophistication and capability. This advancement empowers enterprises and developers across various industries to solve complex problems and unlock new opportunities. However, the growth in gen AI models — with a tenfold increase in parameters annually over the past five years — brings heightened requirements for training, tuning, and inference. Today's larger models, featuring hundreds of billions or even trillions of parameters, require extensive training periods, sometimes spanning months, even on the most specialized systems. Additionally, efficient AI workload management necessitates a coherently integrated AI stack consisting of optimized compute, storage, networking, software and development frameworks.

Today, to address these challenges, we are excited to announce Cloud TPU v5p, our most powerful, scalable, and flexible AI accelerator thus far. TPUs have long been the basis for training and serving AI-powered products like YouTube, Gmail, Google Maps, Google Play, and Android. In fact, Gemini, Google’s most capable and general AI model announced today, was trained on, and is served, using TPUs.

In addition, we are also announcing AI Hypercomputer from Google Cloud, a groundbreaking supercomputer architecture that employs an integrated system of performance-optimized hardware, open software, leading ML frameworks, and flexible consumption models. Traditional methods often tackle demanding AI workloads through piecemeal, component-level enhancements, which can lead to inefficiencies and bottlenecks. In contrast, AI Hypercomputer employs systems-level codesign to boost efficiency and productivity across AI training, tuning, and serving.

Inside Cloud TPU v5p, our most powerful and scalable TPU accelerator to date

Earlier this year, we announced the general availability of Cloud TPU v5e. With 2.3X price performance improvements over the previous generation TPU v4¹, it is our most cost-efficient TPU to date. By contrast, Cloud TPU v5p, is our most powerful TPU thus far. Each TPU v5p pod composes together 8,960 chips over our highest-bandwidth inter-chip interconnect (ICI) at 4,800 Gbps/chip in a 3D torus topology. Compared to TPU v4, TPU v5p features more than 2X greater FLOPS and 3X more high-bandwidth memory (HBM).

Designed for performance, flexibility, and scale, TPU v5p can train large LLM models 2.8X faster than the previous-generation TPU v4. Moreover, with second-generation SparseCores, TPU v5p can train embedding-dense models 1.9X faster than TPU v4².

Source: TPU v5p and v4 are based on Google Internal Data. As of November, 2023: All numbers normalized per chip seq-len=2048 for GPT-3 175 billion parameter model.

Source: TPU v5e data is from MLPerf™ 3.1 Training Closed results for v5e. TPU v5p and v4 are based on Google internal training runs. As of November, 2023: All numbers normalized per chip seq-len=2048 for GPT-3 175 billion parameter model. It shows relative performance per dollar using the public list price of TPU v4 ($3.22/chip/hour), TPU v5e ( $1.2/chip/hour) and TPU v5p ($4.2/chip/hour).

In addition to performance improvements, TPU v5p is also 4X more scalable than TPU v4 in terms of total available FLOPs per pod. Doubling the floating-point operations per second (FLOPS) over TPU v4 and doubling the number of chips in a single pod provides considerable improvement in relative performance in training speed.

Google AI Hypercomputer delivers peak performance and efficiency at large scale

Achieving both scale and speed is necessary, but not sufficient to meet the needs of modern AI/ML applications and services. The hardware and software components must come together into an integrated, easy-to-use, secure, and reliable computing system. At Google, we’ve done decades of research and development on this very problem, culminating in AI Hypercomputer, a system of technologies optimized to work in concert to enable modern AI workloads.

Performance-optimized hardware: AI Hypercomputer features performance-optimized compute, storage, and networking built over an ultrascale data center infrastructure, leveraging a high-density footprint, liquid cooling, and our Jupiter data center network technology. All of this is predicated on technologies that are built with efficiency at their core; leveraging clean energy and a deep commitment to water stewardship, and that are helping us move toward a carbon-free future.
Open software: AI Hypercomputer enables developers to access our performance-optimized hardware through the use of open software to tune, manage, and dynamically orchestrate AI training and inference workloads on top of performance-optimized AI hardware.
- Extensive support for popular ML frameworks such as JAX, TensorFlow, and PyTorch are available right out of the box. Both JAX and PyTorch are powered by OpenXLA compiler for building sophisticated LLMs. XLA serves as a foundational backbone, enabling the creation of complex multi-layered models (Llama 2 training and inference on Cloud TPUs with PyTorch/XLA). It optimizes distributed architectures across a wide range of hardware platforms, ensuring easy-to-use and efficient model development for diverse AI use cases (AssemblyAI leverages JAX/XLA and Cloud TPUs for large-scale AI speech).
- Open and unique Multislice Training and Multihost Inferencing software, respectively, make scaling, training, and serving workloads smooth and easy. Developers can scale to tens of thousands of chips to support demanding AI workloads.
- Deep integration with Google Kubernetes Engine (GKE) and Google Compute Engine, to deliver efficient resource management, consistent ops environments, autoscaling, node-pool auto-provisioning, auto-checkpointing, auto-resumption, and timely failure recovery.
Flexible consumption: AI Hypercomputer offers a wide range of flexible and dynamic consumption choices. In addition to classic options, such as Committed Use Discounts (CUD), on-demand pricing, and spot pricing, AI Hypercomputer provides consumption models tailored for AI workloads via Dynamic Workload Scheduler. Dynamic Workload Scheduler introduces two models: Flex Start mode for higher resource obtainability and optimized economics, as well as Calendar mode, which targets workloads with higher predictability on job-start times.

Leveraging Google’s deep experience to help power the future of AI

Customers like Salesforce and Lightricks are already training and serving large AI models with Google Cloud’s TPU v5p AI Hypercomputer — and already seeing a difference:

“We’ve been leveraging Google Cloud TPU v5p for pre-training Salesforce’s foundational models that will serve as the core engine for specialized production use cases, and we’re seeing considerable improvements in our training speed. In fact, Cloud TPU v5p compute outperforms the previous generation TPU v4 by as much as 2X. We also love how seamless and easy the transition has been from Cloud TPU v4 to v5p using JAX. We’re excited to take these speed gains even further by leveraging the native support for INT8 precision format via the Accurate Quantized Training (AQT) library to optimize our models.” - Erik Nijkamp, Senior Research Scientist, Salesforce

“Leveraging the remarkable performance and ample memory capacity of Google Cloud TPU v5p, we successfully trained our generative text-to-video model without splitting it into separate processes. This optimal hardware utilization significantly accelerates each training cycle, allowing us to swiftly conduct a series of experiments. The ability to train our model quickly in each experiment facilitates rapid iteration, which is an invaluable advantage for our research team in this competitive field of generative AI.” - Yoav HaCohen, PhD, Core Generative AI Research Team Lead, Lightricks

“In our early-stage usage, Google DeepMind and Google Research have observed 2X speedups for LLM training workloads using TPU v5p chips compared to the performance on our TPU v4 generation. The robust support for ML Frameworks (JAX, PyTorch, TensorFlow) and orchestration tools enables us to scale even more efficiently on v5p. With the 2nd generation of SparseCores we also see significant improvement in the performance of embeddings-heavy workloads. TPUs are vital to enabling our largest-scale research and engineering efforts on cutting edge models like Gemini.” - Jeff Dean, Chief Scientist, Google DeepMind and Google Research

At Google, we’ve long believed in the power of AI to help solve challenging problems. Until very recently, training large foundation models and serving them at scale was too complicated and expensive for many organizations. Today, with Cloud TPU v5p and AI Hypercomputer, we’re excited to extend the result of decades of research in AI and systems design with our customers, so they can innovate with AI faster, more efficiently, and more cost effectively.

To request access to Cloud TPU v5p and AI Hypercomputer, please reach out to your Google Cloud account manager.

^{1: MLPerf™ v3.1 Training Closed, multiple benchmarks as shown. Retrieved November 8th, 2023 from} ^{mlcommons.org}^{. Results 3.1-2004. Performance per dollar is not an MLPerf metric. TPU v4 results are unverified: not verified by MLCommons Association. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See} ^{www.mlcommons.org} ^{for more information.
2: Google Internal Data for TPU v5p as of November, 2023: E2E steptime, SearchAds pCTR, batch size per TPU core 16,384, 125 vp5 chips}

How we’ll build sustainable, scalable, secure infrastructure for an AI-driven future

Tue, 17 Oct 2023 15:30:00 +0000

Editor’s note: Today, we hear from Parthasarathy Ranganathan, Google VP and Technical Fellow and Amin Vahdat, VP/GM. Partha delivered a keynote address today at the OCP Global Summit, an annual conference for leaders, researchers, and pioneers in the open hardware industry. Partha served on the OCP Board of Directors from 2020 to earlier this year, when he was succeeded by Amber Huffman as Google’s representative. Read on to hear about the macro trends driving systems design today, and an overview of all of our activities in the community.

At Google, we build planet-scale computing for services that power billions of users, and these services have led to incredible opportunities for system designers to create hardware that operates with high performance, resilience, efficiency, and all at scale. In short, we have embraced open innovation for a new era of systems design.

Today, we are at a new fundamental inflection point in computing: the rise of AI. Google products have always had a strong AI component, but in the past year, we have seen a tectonic shift in the industry and have supercharged our core products with the power of generative AI.

These advances have shown up across our computing systems and workloads, from the original Transformer model in 2017, to PaLM in 2022, to Bard today. Large language models have grown from having hundreds of millions of parameters to trillions of parameters, growing by almost an order of magnitude every year. As model sizes increase, so does the computation needed to run these models. That, in essence, sets up the challenge and opportunity that the open innovation community needs to solve together.

AI isn’t just an enabler of new applications — it also represents a fundamental platform shift — something that we need to innovate on across hardware and software. Together, we need to build the hardware and software platforms that deliver powerful AI solutions across complex machine-learning supercomputers, all in a sustainable, secure, and scalable manner.

Towards sustainable systems

Sustainability is an imperative that we all share. Here are several efforts we are engaged in to help our industry towards achieving net-zero emissions:

Net Zero Innovation Hub: The industry answered our call from the OCP Regional Summit in April for a pan-European public and private collaboration to advance sustainability at a regional level. We launched the Net Zero Innovation Hub with co-founders Danfoss, Google, Microsoft, and Schneider Electric on September 28 with an ambitious agenda across all scopes, including waste-heat reuse and grid availability.
Greener concrete: In collaboration with iMasons Climate Accord, AWS, Google, Meta, and Microsoft, we delivered an ambitious technology roadmap to decarbonize concrete. We invite the community to partner with us to execute this roadmap together.
Sustainability metrics: Last year, we formed the OCP Data Center Facilities Sustainability Subproject, co-led by Google and Microsoft. The group is making important progress on establishing clear, consistent and standardized metrics for emissions/carbon, energy, water, and beyond. This work will enable an apples-to-apples data-driven approach to assess the best approaches to help achieve our shared goals.

Enhancing security across the systems stack

Security includes both trusted computing and reliable computing, and there are several exciting developments coming in this space, including:

Caliptra: Caliptra is a re-usable IP block for root-of-trust management. Last year, with industry leaders, AMD, Microsoft, and NVIDIA, we contributed the draft Caliptra specification to OCP. The Caliptra specification will be complete this year, with the IP block ready for integration into CPUs, GPUs, and other devices. Check out the code repository at https://github.com/chipsalliance/caliptra.
OCP S.A.F.E.: In partnership with OCP and Microsoft, we have developed the OCP Security Appraisal Framework and Enablement (S.A.F.E.) program. OCP S.A.F.E. provides a standardized approach for provenance, code quality, and software supply chain for firmware releases. Learn more at https://www.opencompute.org/projects/ocp-safe-program.
Reliable Computing: Last year, we formed a server-component resilience workstream at OCP along with AMD, ARM, Intel, Meta, Microsoft, and NVIDIA to take a systems approach to addressing silicon faults and silent data errors. The team has made great strides, including publishing the draft specification and open-sourcing Silent Data Corruption (SDC) frameworks (e.g., Intel and ARM collaborating on Open Datacenter Diagnostics, AMD’s Open Field Health Check, and NVIDIA’s Datacenter GPU Manager). To advance this important area faster, we are launching a new academic grant program — the first of its kind at OCP — with member companies supporting significant academic research in this area.

Scalability from silicon to the cloud

Scalable infrastructure is a primary area of focus for both Google and OCP, from silicon all the way to the cloud. At the OCP Summit this week, we will discuss a few advancements, specifically:

Accelerators: This year, we partnered with AMD, ARM, Intel, Meta, and NVIDIA to deliver the OCP 8-bit Floating Point specification to enable training on one accelerator and serving on another. We partnered with Microsoft and NVIDIA to deliver a set of firmware specifications for GPUs and accelerators covering reliability, manageability, and updates.
AI: During the AI Track, we are highlighting the progress we are making with partners in the OpenXLA ecosystem. We are also discussing the Architecture Gym, a new effort in collaboration with MLCommons to go beyond systems for AI, to AI for systems, looking at how AI can transform systems design.
Networking: To truly build large-scale AI infrastructure, you need world-class networking systems innovation. To help with this, we are opening Falcon, Google’s reliable low-latency hardware transport, and sharing some of the advances we have made over the past 10 years on performance, latency, traffic control, etc. This is part of our ongoing effort to advance Ethernet to the industry as a high-performance, low-latency fabric for hyperscaler environments. Learn more in the blog “Google opens Falcon, a reliable low-latency hardware transport, to the ecosystem”.
Storage: Google is joining the OCP Data Center NVM Express™ (NVMe) specification, working group with Meta, Microsoft, Dell, and HPE to provide clear requirements for features in datacenter SSDs including Flexible Data Placement, security, and telemetry. We are also kicking off a new open-source hardware effort to develop an NVMe Key Management block with partners Microsoft, Samsung, Kioxia and Solidigm.

There is tremendous opportunity for all of us in the industry to create even more open ecosystems for innovation. At Google, we have a legacy of embracing and fostering open ecosystems, whether it’s Android, Chromium, Kubernetes, Kaggle, Tensorflow, or Jax. We set industry standards, grow communities, and share our innovations broadly. Our contributions to the Open Compute Project Foundation go back several years, from our first 48V contribution to today, sitting on the OCP Board and being one of its largest contributors. We believe the best is yet to come, through codesign and collaboration across hardware and software, multiple layers of the stack, compute, network, storage, infrastructure, industry and academia, and of course, across companies.

It is exciting to be in an era where we are literally inventing the future with new AI advances every day. All these amazing AI advances in turn need a healthy innovation ecosystem around infrastructure, from all of us — to build the sustainable, secure, scalable societal infrastructure that we need for this AI-driven future. And all of this will be possible only through collaboration across all of us in the community. You can learn more about the OCP Global Summit agenda here and talks by Google here. We are looking forward to the vibrant discussions this week.

Google opens Falcon, a reliable low-latency hardware transport, to the ecosystem

Tue, 17 Oct 2023 15:30:00 +0000

At Google, we have a long history of solving problems at scale using Ethernet, and rethinking the transport layer to satisfy demanding workloads that require high burst bandwidth, high message rates, and low latency. Workloads such as storage have needed some of these attributes for a long time, however, with newer use cases such as massive-scale AI/ML training and high performance computing (HPC), the need has grown significantly. In the past, we’ve openly shared our learnings in traffic shaping, congestion control, load balancing, and more with the industry by contributing our ideas to the Association for Computing Machinery and Internet Engineering Task Force. These ideas have been implemented in software and a few in hardware for several years. But going forward, we believe the industry at large will see more gains by implementing the set with dedicated and flexible hardware assist.

To achieve this goal, we developed Falcon to enable a step function in performance over software-only transports. Today at the OCP Global Summit, we are excited to open Falcon to the ecosystem through the Open Compute Project, the natural venue to empower the community with Google’s production learnings to help modernize Ethernet.

As a hardware-assisted transport layer, Falcon is designed to be reliable, high performance, and low latency and leverages production-proven technologies including Carousel, Snap, Swift, PLB, and CSIG.

Falcon’s layers are illustrated in the figure below, including their associated function. We show the RDMA and NVM Express™ Upper layer protocols (ULPs), however, Falcon is extensible to additional ULPs as needed by the ecosystem.

The lower layers of Falcon use three key insights to achieve low latency in high-bandwidth, yet lossy, Ethernet data center networks. Fine-grained hardware-assisted round-trip time (RTT) measurements with flexible, per-flow hardware-enforced traffic shaping, and fast and accurate packet retransmissions, are combined with multipath-capable and PSP-encrypted Falcon connections. On top of this foundation, Falcon has been designed from the ground up as a multi-protocol transport capable of supporting ULPs with widely varying performance requirements and application semantics. The ULP mapping layer not only provides out-of-the-box compatibility with Infiniband Verbs RDMA and NVMe ULPs, but also includes additional innovations critical for warehouse-scale applications such as flexible ordering semantics and graceful error handling. Last but not least, the hardware and software are co-designed to work together to help achieve the desired attributes of high message rate, low latency, and high bandwidth, while maintaining flexibility for programmability and continued innovation.

Falcon reflects the central role that Ethernet continues to play in our industry. Falcon is designed for predictable high performance at warehouse scale, as well as flexibility and extensibility. We look forward to working with the community and industry partners to modernize Ethernet to serve the networking requirements of our AI-driven future. We believe that Falcon will be a valuable addition to the other ongoing efforts in this space.

Industry perspectives

Our partners across the industry are enthusiastic about the promise that Falcon holds for developing the next generation of Ethernet.

“We welcome Google’s contribution of Falcon as it shares the Ultra Ethernet Consortium’s vision to drive Ethernet as the best data center fabric for AI and HPC, and look forward to continuing industry innovations in this important space.” - Dr. J Metz, Chair, Ultra Ethernet Consortium (led by AMD, Arista, Broadcom, Cisco, Eviden, Hewlett Packard Enterprise, Intel, Meta, Microsoft, and Oracle)

“Falcon is first available in the Intel IPU E2000 series of products. The value of these IPUs is further enhanced as the first instance of an Ethernet transport to add low tail latency and congestion handling at scale. Intel is a Steering Member of Ultra Ethernet Consortium, which is working to evolve Ethernet for high performance AI and HPC workloads. We plan to deploy the resulting standards-based enhancements in future IPU and Ethernet products.” - Sachin Katti, SVP & GM, Network and Edge Group, Intel

"We are pleased to see a high-performance transport protocol for critical workloads such as AI and HPC that works over standard Ethernet/IP networks and enables massive application bandwidth at scale." - Hugh Holbrook, Group VP, SW Eng., Arista Networks

“Cisco is pleased to see the contribution of Falcon to the OCP. Cisco has long supported open standards and believes in broad ecosystems. The rate and scale of modern data center networks and particularly AI/ML networks is unprecedented, presenting a challenge and opportunity to the industry. Falcon addresses many of the challenges of these networks, enabling efficient network utilization.” - Ofer Iny, Cisco Fellow, Cisco

“Juniper is a strong supporter of open ecosystems, and therefore we are pleased to see Falcon being opened to the OCP community. Falcon allows Ethernet to serve as the data center network-of-choice for demanding workloads, providing high-bandwidth, low tail latency and congestion mitigation. Falcon provides the industry with a proven solution today for demanding AI & ML workloads.” - Raj Yavatkar, Chief Technology Officer, Juniper

“Marvell strongly supports and is committed to the open Ethernet ecosystem as it evolves to support emerging, demanding workloads such as AI. We applaud the contribution of Falcon to OCP and welcome Google sharing practical experiences with the industry.” - Nick Kucharewski, SVP & GM Network Switching Group, Marvell

Learn more

Networking is a foundational component in building the sustainable, secure, scalable societal infrastructure that we need for this AI-driven future. To learn more about Falcon, join us for the OCP Summit presentation, “A Reliable and Low Latency Ethernet Hardware Transport” by Google’s Nandita Dukkipati at 11:45am at the Expo Hall. We’ll contribute the Falcon specification to OCP in the first quarter of 2024.

To learn more about Google’s contributions to the Open Compute Project and our presence at the OCP Global Summit, check out the blog “How we’ll build sustainable, scalable, secure infrastructure for an AI-driven future”.