High Performance Computing

H4D VMs, now GA, deliver exceptional performance and scaling for HPC workloads

Wed, 04 Mar 2026 17:00:00 +0000

Today, we’re announcing the general availability of H4D VMs, our latest high performance computing (HPC)-optimized VM, powered by the 5th Generation AMD EPYC™ processors. H4D VMs deliver exceptional performance, scalability, and value for industries like manufacturing, health care and life sciences, weather forecasting, and electronic design automation (EDA). H4D supports orchestration via Cluster Toolkit with Slurm and via Google Kubernetes Engine (GKE). Each approach allows for near-instant deployment and scaling of demanding workloads.

For the first time, the Google Cloud CPU portfolio features a VM family with Cloud Remote Direct Memory Access (RDMA). H4D’s RDMA is on the Titanium network adapter and lets you scale single-node H4D performance to multiple nodes, accelerating large production workloads.

Faster time to solution across domains and scales

Powered by the high core density of the 5th Gen AMD EPYC CPU and Google’s innovative, low-latency Falcon hardware transport, H4D VMs enable you to iterate and discover faster than ever before.

We demonstrated H4D performance through a series of industry-standard benchmarks, showing its capabilities across diverse domains and problem sizes.

Healthcare and life sciences
For researchers in healthcare and life sciences (HCLS), H4D VMs accelerate complex molecular simulations critical to scientific discovery. Compared to our previous C2D VMs, H4D VMs deliver up to a 4.3X speedup running LAMMPs (LJ benchmark) at 96 VMs, delivering 95% parallel efficiency on 18k cores. For drug discovery, we demonstrated a 5.8X speed-up using GROMACS (water_33m) at 32 VMs delivering 72% parallel efficiency on 6k cores. H4D also delivers further scalability, which we demonstrated by running the LAMMPS LJ benchmark on 192 VMs (~37k cores) while maintaining 92% parallel efficiency (see Figure 3).

Manufacturing
For manufacturing, H4D VMs help engineers shorten design cycles, run larger simulations, and iterate faster by delivering a strong performance boost for mission-critical Computer-Aided Engineering (CAE) workflows. Compared to our previous C2D VMs when running complex Computational Fluid Dynamics (CFD) simulations, H4D VMs deliver a 4.1X speedup running Ansys Fluent (F1_RaceCar_140m benchmark) on 32 VMs with 85% parallel efficiency. When running open-source OpenFOAM (Motorbike_100m), we demonstrated a 5.2X speedup over C2D using 16 VMs and achieving superlinear parallel efficiency of 122%.

A new standard for HPC price/performance

H4D VMs are designed to deliver the best price-performance for HPC workloads on Google Cloud by pairing superior performance with flexible consumption models. H4D supports Dynamic Workload Scheduler (DWS), which adapts to your workflow with Flex Start mode for just-in-time capacity and Calendar mode for guaranteed reservations. This allows you to access compute for as low as 3 cents per core-hour without long-term commitments. The resulting performance and cost efficiencies over previous generation VMs are detailed in Figures 6 and 7.

Comprehensive HPC management

To manage and deploy large, dense clusters of H4D VMs, you can leverage Google Cloud’s Cluster Director, which offers advanced maintenance capabilities (you can sign up for the preview here) alongside the Cluster Toolkit for rapid cluster deployment via turnkey system blueprints. For job and workload management, H4D VMs integrate with Batch, Google Cloud’s fully managed, cloud-native service that handles queuing, scheduling, and resource provisioning. Additionally, there’s support for DWS, which can be used in both Calendar mode for future reservations and Flex Start mode for time-limited, on-demand usage.

What customers and partners are saying

“We were able to test the H4D platform in early access at Jump Trading, and were extremely impressed with the results. The successful testing process demonstrated that H4D offers the performance, stability, and efficiency we require for demanding, high-volume operations. We see up to 50% better price/performance compared to prior generation machines and are now accelerating integration with our critical grid workloads on Google Cloud." - Alex Davies, Chief Technology Officer & Benjamin Stromski, HPC Linux Engineering, Jump Trading

“There lingers, especially in large-scale and compute-intensive domains, the idea that the fastest systems can only be built on premises and run on bare metal hardware. Terms such as ‘hypervisor tax” are often thrown around as justification for operating with bare metal. Our testing paints a different picture. The Google H4D VM performs better on our financial risk benchmark than the bare metal top of stack AMD CPU of the same generation." - Hamza Mian/CEO, HMxLabs

"As a leading provider of managed HPC solutions for the demanding CAE and manufacturing sectors, our evaluation of the H4D platform was focused heavily on its ability to handle our clients' largest, most tightly-coupled simulation workloads. We are extremely impressed with the results. The testing confirmed that the underlying RDMA fabric exhibits the outstanding low-latency and high-bandwidth performance required for massive parallel processing. This level of interconnect efficiency is non-negotiable for speeding up critical manufacturing simulations like crash testing and CFD. H4D has proven itself to be a true accelerator for high-throughput engineering workloads, and we are excited about its potential to redefine the performance ceiling for HPC in the engineering world." - Rodney Mach/President, TotalCAE

“The new H4D instances are a significant step forward for our demanding next-generation TPU simulation workloads. We've seen a 30% performance improvement across a variety of EDA benchmarks compared to C2D, demonstrating the strong single core performance of H4D. This directly translates to faster development cycles and allows our engineering teams to iterate more quickly” - Trevor Switkowski, Technical Lead of Chip Design Methodology, Google Cloud

Experience H4D today

H4D is now available in us-central1-a (Iowa), europe-west4-b (Netherlands) and asia-southeast1-a (Singapore) with additional regions coming soon. Check regional availability on our Regions and Zones page and deploy your most demanding HPC workloads by leveraging Cloud RDMA.

_{The following configurations were run for the above benchmarks:LAMMPS version 20250722, GROMACS: version 2023.1, OpenFOAM version 2312, Ansys Fluent version 2024R1. All runs used IntelMPI 2021.17.2. C2D/C3D/C4D used TCP, H4D used RDMA with RXM & SAR_LIMIT=2G. All runs used full ppn (processes-per-node) available on each platform (56, 180, 192 for C2D, C3D and C4D/H4D respectively). Ansys Fluent runs used 168ppn on H4D and variable ppn for C4D. SMT off for all. Cost comparision across single nodes of H4D-highmem-192 with DWS Flex Start price, c3d-standard-360 and c2d-standard-112 OD price.}

_{Parallel efficiency and optimal node count depend on input size and communication patterns, and therefore vary across workloads.}

Accelerating discovery at the speed of cloud: What’s New for HPC at Google Cloud for SC25

Fri, 14 Nov 2025 17:00:00 +0000

With the pace of scientific discovery moving faster than ever, we’re excited to join the supercomputing community as it gets ready for its annual flagship event, SC25, in St. Louis from November 16-21, 2025. There, we’ll share how Google Cloud is poised to help with our lineup of HPC and AI technologies and innovations, helping researchers, scientists, and engineers solve some of humanity's biggest challenges.

Redefining supercomputing with cloud-native HPC

Supercomputers are evolving from a rigid, capital-intensive resource into an adaptable, scalable service. To go from “HPC in the cloud” to “cloud-native HPC,” we leverage core principles of automation and elastic infrastructure to fundamentally change how you consume HPC resources, allowing you to spin up purpose-built clusters in minutes with the exact resources you need.

This cloud-native model is very flexible. You can augment an on-premises cluster to meet peak demand or build a cloud-native system tailored with the right mix of hardware for your specific problem — be it the latest CPUs, GPUs, or TPUs. With this approach, we’re democratizing HPC, putting world-class capabilities into the hands of startups, academics, labs, and enterprise teams alike.

Key highlights at SC25:

Next-generation infrastructure: We’ll be showcasing our latest H4D VMs, powered by 5th generation AMD EPYC processors and featuring Cloud RDMA for low-latency networking. You’ll also see our latest accelerated compute resources including A4X and A4X Max VMs featuring the latest NVIDIA GPUs with RDMA.
Powering your essential applications: Run your most demanding simulations at massive scale — from Computational Fluid Dynamics (CFD) with Ansys, to Computer-Aided Engineering with Siemens, computational chemistry with Schrodinger, and risk modeling in FSI.
Dynamic Workload Scheduler: Discover how Dynamic Workload Scheduler and its innovative Flex Start mode, integrated with familiar schedulers like Slurm, is reshaping HPC consumption. Move beyond static queues toward flexible, cost-effective, and efficient access to high-demand compute resources.
Easier HPC with Cluster Toolkit: Learn how Cluster Toolkit can help you deploy a supercomputer-scale cluster with less than 50 lines of code.
High-throughput, scalable storage: Get a deep dive into Google Cloud Managed Lustre, a fully managed, high-performance parallel file system that can handle your most demanding HPC and AI workloads.
Hybrid for the enterprise: For our enterprise customers, especially in financial services, we're enabling hybrid cloud with IBM Spectrum Symphony Connectors, allowing you to migrate or burst workloads to Google Cloud and reduce time-to-solution.

AI-powered scientific discovery

There’s a powerful synergy between HPC and AI — where HPC builds more powerful AI, and AI makes HPC faster and more insightful. This complementary relationship is fundamentally changing how research is done, accelerating discovery in everything from drug development and climate modeling to new materials and engineering. At Google Cloud, we’re at the forefront of this transformation, building the models, tools, and platforms that make it possible.

What to look for:

AI for scientific productivity: We’ll be showcasing Google’s suite of AI tools designed to enhance the entire research lifecycle. From Idea Generation agent to Gemini Code Assist with Gemini Enterprise, you’ll see how AI can augment your capabilities and accelerate discovery.

AI-powered scientific applications: Learn about the latest advancements in our AI-powered scientific applications including AlphaFold 3 and Weather Next
The power of TPUs: Explore Google's TPUs, including the latest seventh-generation Ironwood model, and discover how they can enhance AI workload performance and efficiency.
Join the Google Cloud at SC25: At Google Cloud, we believe the cloud is the supercomputer of the future. From purpose-built HPC and AI infrastructure to quantum breakthroughs and simplified open-source tools, let Google Cloud be the platform for your next discovery.

We invite you to connect with our experts and learn more. Join the Google Cloud Advanced Computing Community to engage in discussions with our partners and the broader HPC, AI, and quantum communities.

We can’t wait to see what you discover.

See us at the show:

Visit us in booth #3724: Stop by for live demos of our latest HPC and AI solutions, including Dynamic Workload Scheduler, Cluster Toolkit, our latest AI agents, and even see our TPUs. Our team of experts will be on hand to answer your questions and discuss how Google Cloud can meet your needs.
Attend our technical talks: Keep an eye on our SC25 schedule for Google Cloud presentations and technical talks, where our leaders and partners will share deep dives, insights, and best practices.
Passport program: Grab a passport card from the Google booth and visit our demos, labs, and talks to collect stamps and learn about how we’re working with organizations across the HPC ecosystem to democratize HPC. Come back to the Google booth with your completed passport card to choose your prize!
Play a game: Join us in the Google booth and at our events to enjoy some Gemini-driven games — test your tech trivia knowledge or compete head-to-head with others to build the best LEGO creation!
Join our community kickoff: Are you a member of the Google Cloud Advanced Computing Community? Secure your spot today for our SC25 Kickoff Happy Hour!
Celebrate with NVIDIA and Google Cloud: We’re proud to co-host a reception with NVIDIA, and we look forward to toasting another year of innovation with our customers and partners. Register today to secure your spot!

How scientists can leverage AI agents using Gemini Enterprise, Gemini Code Assist, and Gemini CLI

Mon, 03 Nov 2025 17:00:00 +0000

Scientific inquiry has always been a journey of curiosity, meticulous effort, and groundbreaking discoveries. Today, that journey is being redefined, fueled by the incredible capabilities of AI. It’s moving beyond simply processing data to actively participating in every stage of discovery, and Google Cloud is at the forefront of this transformation, building the tools and platforms that make it possible.

The sheer volume of data generated by modern research is immense, often too vast for human analysis alone. This is where AI steps in, not just as a tool, but as a collaborative force. We’re seeing powerful new models and AI agents assist with everything from identifying relevant literature and generating novel hypotheses to designing experiments, running simulations, and making sense of complex results. This collaboration doesn’t replace human intellect; it amplifies it, allowing researchers to explore more avenues, more quickly, and with greater precision.

At Google Cloud, we’re bringing together high-performance computing (HPC) and advanced AI on a single, integrated platform. This means you can seamlessly move from running massive-scale simulations to applying sophisticated machine learning models, all in one environment.

So, how can you leverage these capabilities to get to insights faster? The journey begins at the foundation of scientific inquiry: the hypothesis.

AI-enhanced scientific inquiry

Every great discovery starts with a powerful hypothesis. With millions of research papers published annually, identifying novel opportunities is a monumental task. To overcome this information overload, scientists can now turn to AI as a powerful research partner.

Our Deep Research agent tackles the first step: performing a comprehensive analysis of published literature to produce detailed reports on a given topic that would otherwise take months to compile. Building on that foundation, our Idea Generation agent then deploys an ensemble of AI collaborators to brainstorm, evaluate, propose, debate, and rank novel hypotheses. This powerful combination, available in Gemini Enterprise, transforms the initial phase of scientific inquiry, empowering researchers to augment their expertise and find connections they might otherwise miss.

Go from hypothesis to results, faster

Once a hypothesis is formed, the work of translating it into executable code begins. This is where AI coding assistants, such as Gemini Code Assist, excel. They automate the tedious tasks of writing analysis scripts and simulation models by generating code from natural language and providing real-time suggestions, dramatically speeding up the core development process.

But modern research is more than just a single script; it’s a complete workflow of data, environments, and results managed from the command line. For this, Gemini CLI brings that same conversational power directly to your terminal. It acts as the ultimate workflow accelerator, allowing you to instantly synthesize research and generate hypotheses with simple commands, then seamlessly transition to experimentation by generating sophisticated analysis scripts, and debugging errors on the fly, all without ever breaking your focus. Gemini CLI can further accelerate your path to impact by transforming raw results into publication-ready text, generating the code for figures and tables, and refining your work for submission.

This capability extends to automating the entire research environment. Beyond single commands, Gemini CLI can manage complex, multi-step processes like cloning a scientific application, installing its dependencies, and then building and testing it—all with a simple prompt, maximizing your productivity.

The new era of discovery: Your expertise, AI agents, and Google Cloud

The new era of scientific discovery is here. By embedding AI into every stage of the scientific process - from sparking the initial idea to accelerating the final analysis - Google Cloud provides a single, unified platform for discovery. This new era of AI-enhanced scientific inquiry is built on a robust, intelligent infrastructure that combines the strengths of HPC simulation and AI. This includes purpose-built solutions like our H4D VMs optimized for scientific simulations, alongside the latest A4 and A4X VMs, powered by the latest NVIDIA GPUs, and Google Cloud Managed Lustre, a parallel file system that eliminates storage bottlenecks and allows your HPC and AI workloads to create and analyze massive datasets simultaneously. We provide the power to streamline the entire process so you can focus on scientific creativity - and changing the world!

Join the Google Cloud Advanced Computing Community to connect with other researchers, share best practices, and stay up to date on the latest advancements in AI for scientific and technical computing, or contact sales to get started today.

Evolving Ray and Kubernetes together for the future of distributed AI and ML

Mon, 03 Nov 2025 17:00:00 +0000

Ray is an OSS compute engine that is popular among Google Cloud developers to handle complex distributed AI workloads across CPUs, GPUs, and TPUs. Similarly, platform engineers have long trusted Kubernetes, and specifically Google Kubernetes Engine, for powerful and reliable infrastructure orchestration. Earlier this year, we announced a partnership with Anyscale to bring the best of Ray and Kubernetes together, forming a distributed operating system for the most demanding AI workloads. Today, we are excited to share some of the open-source enhancements we have built together across Ray and Kubernetes.

Ray and Kubernetes label-based scheduling

One of the key benefits of Ray is its flexible set of primitives that enable developers to write distributed applications without thinking directly about the underlying hardware. However, there are some use cases that weren’t very well covered by the existing support for virtual resources in Ray.

To improve scheduling flexibility and empower the Ray and Kubernetes schedulers to perform better autoscaling for Ray applications, we are introducing label selectors to Ray. Ray label selectors are heavily inspired by Kubernetes labels and selectors, and intend to offer a familiar experience and smooth integration between the two systems. The Ray Label Selector API is available starting on Ray v2.49 and offers improved scheduling flexibility for distributed tasks and actors.

With the new Label Selector API, Ray now directly helps developers accomplish things like:

Assign labels to nodes in your Ray cluster (e.g. gpu-family=L4, market-type=spot, region=us-west-1).
When launching tasks, actors or placement groups, declare which zones, regions or accelerator types to run on.
Use custom labels to define topologies and advanced scheduling policies.

For scheduling distributed applications on GKE, you can use Ray and Kubernetes label selectors together to gain full control over application and the underlying infrastructure. You can also use this combination with GKE custom compute classes to define fallback behavior when specific GPU types are unavailable. Let’s dive into a specific example.

Below is an example Ray remote task that could run on various GPU types depending on available capacity. Starting in Ray v2.49, you can now define the accelerator type to bind GPUs with fallback behavior in cases where the primary GPU type or market type is not available. In this example, the remote task is targeting spot capacity with L4 GPUs but with a fallback to on-demand:

code_block: <ListValue: [StructValue([('code', '@ray.remote(\r\n label_selector={\r\n "ray.io/accelerator": "L4"\r\n "ray.io/market-type": "spot"\r\n },\r\n fallback_strategy=[\r\n {\r\n "label_selector": {\r\n "ray.io/accelerator": "L4"\r\n "ray.io/market-type": "on-demand"\r\n }\r\n },\r\n ]\r\n)\r\ndef func():\r\n pass'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1c67fc70>)])]>

On GKE, you can couple the same fallback logic using custom compute classes such that the underlying infrastructure for the Ray cluster matches the same fallback behavior:

code_block: <ListValue: [StructValue([('code', 'apiVersion: cloud.google.com/v1\r\nkind: ComputeClass\r\nmetadata:\r\n name: gpu-compute-class\r\nspec:\r\n priorities:\r\n - gpu:\r\n type: nvidia-l4\r\n count: 1\r\n spot: true\r\n - gpu:\r\n type: nvidia-l4\r\n count: 1\r\n spot: false\r\n nodePoolAutoCreation:\r\n enabled: true\r\n whenUnsatisfiable: DoNotScaleUp'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1c67f2e0>)])]>

Refer to the Ray documentation to get started with Ray label selectors.

Advancing accelerator support in Ray and Kubernetes

Earlier this year we demonstrated the ability to use the new Ray Serve LLM APIs to deploy large models such as DeepSeek-R1 on GKE with A3 High and A3 Mega machine instances. Starting on GKE v1.33 and KubeRay v1.4, you can use Dynamic Resource Allocation (DRA) for flexible scheduling and sharing of hardware accelerators, enabling the use of the next-generation of AI accelerators with Ray. Specifically, you can now use DRA to deploy Ray clusters on A4X series machines utilizing the NVIDIA GB200 NVL72 rack-scale architecture. To use DRA with Ray on A4X, create an AI-optimized GKE cluster on A4X and define a ComputeDomain resource representing your NVL72 rack:

code_block: <ListValue: [StructValue([('code', 'apiVersion: resource.nvidia.com/v1beta1\r\nkind: ComputeDomain\r\nmetadata:\r\n name: a4x-compute-domain\r\nspec:\r\n numNodes: 18\r\n channel:\r\n resourceClaimTemplate:\r\n name: a4x-compute-domain-channel'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1c67ff70>)])]>

And then specify the claim in your Ray worker’s Pod template:

code_block: <ListValue: [StructValue([('code', 'workerGroupSpecs:\r\n ...\r\n template:\r\n...\r\nspec:\r\n ...\r\n volumes:\r\n ...\r\n containers:\r\n - name: ray-container\r\n ...\r\n resources:\r\n limits:\r\n nvidia.com/gpu: 4\r\n\t claims:\r\n - name: compute-domain-channel\r\n ...\r\nresourceClaims:\r\n - name: compute-domain-channel\r\n resourceClaimTemplateName: a4x-compute-domain-channel'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1c67f760>)])]>

Combining DRA with Ray ensures that Ray worker groups are correctly scheduled on the same GB200 NVL72 rack for optimal GPU performance for the most demanding Ray workloads.

We’re also partnering with Anyscale to bring a more native TPU experience to Ray and closer ecosystem integrations with frameworks like JAX. Ray Train introduced a JAXTrainer API starting in Ray v2.49, streamlining model training on TPUs using JAX. For more information on these TPU improvements in Ray, read A More Native Experience for Cloud TPUs with Ray.

Ray-native resource isolation with Kubernetes writable cgroups

Writable cgroups allow the container's root process to create nested cgroups within the same container without requiring privileged capabilities. This feature is especially critical for Ray, which runs multiple control-plane processes alongside user code inside the same container. Even under the most intensive workloads, Ray can dynamically reserve a portion of the total container resources for system critical tasks, which significantly improves the reliability of your Ray clusters.

Starting on GKE v1.34, you can enable writable cgroups for Ray clusters. This first requires a one-time setup on your node pools by customizing the containerd configuration. Add the following to your containerd configuration file:

code_block: <ListValue: [StructValue([('code', 'writableCgroups:\r\n enabled: true'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1c67f2b0>)])]>

You then specify this updated configuration when you create or update a cluster or node pool. Once your nodes are configured, you can enable writable cgroups for Ray clusters by adding the following annotations:

code_block: <ListValue: [StructValue([('code', 'metadata:\r\n annotations:\r\n node.gke.io/enable-writable-cgroups.test-container: "true"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1c67f220>)])]>

To enable Ray resource isolation using writable cgroups, set the following flags in ray start:

code_block: <ListValue: [StructValue([('code', 'ray start --head --enable-resource-isolation'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1e2a48e0>)])]>

This capability is one such example of how we’re evolving Ray and Kubernetes to improve reliability across the stack without compromising on security.

In the near future, we plan to also introduce support for per-task and per-actor resource limits and requirements, a long requested feature in Ray. Additionally, we are collaborating with the open-source Kubernetes community to upstream this feature. To learn more, check out the documentation.

Ray vertical autoscaling with in-place pod resizing

With the introduction of in-place pod resizing in Kubernetes v1.33, we’re in the early stages of integrating vertical scaling capabilities for Ray when running on Kubernetes. Our early benchmarks show a 30% increase in workload efficiency due to scaling pods vertically before scaling horizontally.

Benchmark based on completing two TPC-H workloads (Query 1 and 5) with Ray, 3 times on a GKE cluster with 3 worker nodes, each with 32 CPUs and 32 GB of memory.

In-place pod resizing enhances workload efficiency in the following ways:

Faster task/actor scale-up: With in-place resizing, Ray workers can scale up their available resources in seconds, an improvement over the minutes it could take to provision new nodes. This capability significantly accelerates the scheduling time for new Ray tasks.
Enhanced bin-packing and resource utilization: In-place pod resizing enables more efficient bin-packing of Ray workers onto Kubernetes nodes. As new Ray workers scale up, they can reserve smaller portions of the available node capacity, freeing up the remaining capacity for other workloads.
Improved reliability and reduced failures: In-place scaling of memory can significantly reduce out-of-memory (OOM) errors. By avoiding the need to restart failed jobs, this capability improves overall workload efficiency and stability.

Ray + Kubernetes = The distributed OS for AI

We are excited to highlight the recent joint innovations from our partnership with Anyscale. The powerful synergy between Ray and Kubernetes positions them as the distributed operating system for modern AI/ML. We believe our continued partnership will accelerate innovation within the open-source Ray and Kubernetes ecosystems, ultimately driving the future of distributed AI/ML.

Together, these updates are a significant step toward Ray working seamlessly on GKE. Here’s how to get started:

Request capacity: Get started quickly with Dynamic Workload Scheduler Flex Start for TPUs and GPUs, which provides access to compute for jobs that run for less than 7 days.
Get started with Ray on GKE
Try out JaxTrainer with TPUs

Google Cloud and AMD at STAC Summit NYC: H4D VMs for Finance

Wed, 22 Oct 2025 17:00:00 +0000

In capital markets, the race for low latency and high performance is relentless. That’s why Google Cloud is partnering with AMD at the premier STAC Summit NYC on Tuesday, October 28th! We’re joining forces to demonstrate how our combined innovations are tackling the most demanding workloads in the financial services industry, from real-time risk analysis to algorithmic trading.

H4D VMs for financial services

At the core of our offerings are the Google Cloud H4D VMs, now in Preview, powered by 5th Gen AMD EPYC processors (codenamed Turin).

The financial world operates at lightning speed, where every millisecond counts. The H4D VM series is purpose built to deliver the extreme performance required for high-frequency trading (HFT), backtesting, market risk simulations (e.g. Monte Carlo), and derivatives pricing. With its exceptional speed and efficiency of communication between cores, massive memory capacity, and optimized network throughput, the H4D series is designed to execute complex computations faster, reduce simulation times, and ultimately deliver a competitive edge.

H4D: Superior performance for financial workloads

To quantify the generational performance leap, we commissioned performance testing by AMD. They compared the new H4D VM directly against the previous generation C3D VM (powered by 4th Gen AMD EPYC processors), using the KX Nano open-source benchmark. This benchmark utility is designed to test the raw CPU, memory, and I/O performance of systems running data operations for kdb+ databases. These high-performance, column-based time series databases are widely used by major financial institutions, including investment banks and hedge funds, to handle large volumes of time-series data like stock market trades and quotes.

The results demonstrated a significant, out-of-the-box performance gain for the H4D series. With no additional system tuning, the H4D VM outperformed the C3D VM by an average of ~34% across all KX Nano test scenarios.

Figure 1: Per-core, cache-sensitive operations (Scenario 1) showed H4D's generational lead with a ~1.36x uplift in performance across all test types, confirming superior speed and efficiency of communication between cores and memory latency for key financial modeling functions. *1

Figure 2: Multi-core scalability with the number of processors set to the max core count and 1 kdb worker per thread (Scenario 2) delivered a ~1.33x performance uplift across all test types, demonstrating H4D's strong capability for parallel processing across all available cores. *2

Figure 3: For heavy, concurrent multi-threaded workloads with 8 threads per kdb+ instance and 1 thread per core (Scenario 3), H4D sustained substantial leadership, delivering relative gains of ~1.33x uplift across all test types. *3

These benchmark results demonstrate the H4D VMs are built to accelerate your most demanding, low-latency workloads, providing the performance required for high-frequency trading, risk simulations, and quantitative analysis.

A full spectrum of financial services solutions

The H4D VMs will be a major highlight for Google Cloud and AMD at the STAC Summit next Tuesday. Our booths will also showcase our full spectrum of solutions for financial institutions. Stop by to discuss how we can help optimize your entire technology stack, from data storage to advanced computation:

IBM Symphony GCE and GKE Connectors: Discover how to extend and manage your existing Platform Symphony grid compute environments by bursting jobs to Compute Engine or Google Kubernetes Engine (GKE).
Managed Lustre: Get extreme performance file storage for your most demanding HPC and quantitative workloads without the operational overhead.
GPUs and TPUs: Learn how our powerful accelerators can dramatically speed up machine learning, AI, and risk analysis tasks.
Cluster Director with Managed Slurm: Easily deploy and manage your HPC cluster workloads with our integration for the popular Slurm workload manager.

Come talk to experts!

We know that performance, security, and compliance are non-negotiable in financial services. Our team will be on site to discuss your specific challenges and demonstrate how Google Cloud, in partnership with AMD, provides the robust, high-performance foundation your firm needs to innovate and thrive.

We look forward to connecting with you at the Google Cloud and AMD booths at STAC Summit NYC on October 28th!

G4 VMs under the hood: A custom, high-performance P2P fabric for multi-GPU workloads

Mon, 20 Oct 2025 16:00:00 +0000

Today, we announced the general availability of the G4 VM family based on NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Thanks to unique platform optimizations only available in Google Cloud, G4 VMs deliver the best performance of any commercially available NVIDIA RTX PRO 6000 Blackwell GPU offering for inference and fine-tuning on a wide range of models, from less than 30B to over 100B parameters. In this blog, we discuss the need for these platform optimizations, how they work, and how to use them in your own environment.

Collective communications performance matters

Large language models (LLMs) vary significantly in size, as characterized by their number of parameters: small (~7B), medium (~70B), and large (~350B+). LLMs often exceed the memory capacity of a single GPU, including the NVIDIA RTX PRO 6000 Blackwell’s, with its 96GB of GDDR7 memory. A common solution is to use tensor parallelism, or TP, which works by distributing individual model layers across multiple GPUs. This involves partitioning a layer's weight matrices, allowing each GPU to perform a partial computation in parallel. However, a significant performance bottleneck arises from the subsequent need to combine these partial results using collective communication operations like All-Gather or All-Reduce.

The G4 family of GPU virtual machines utilizes a PCIe-only interconnect. We drew on our extensive infrastructure expertise to develop this high-performance, software-defined PCIe fabric that supports peer-to-peer (P2P) communication. Crucially, G4’s platform-level P2P optimization substantially accelerates collective communications for workloads that require multi-GPU scaling, resulting in a notable boost for both inference and fine-tuning of LLMs.

How G4 accelerates multi-GPU performance

Multi-GPU G4 VM shapes get their significantly enhanced PCIe P2P capabilities from a combination of both custom hardware and software. This advancement directly optimizes collective communications, including All-to-All, All-Reduce, and All-Gather collectives for managing GPU data exchange. The result is a low-latency data path that delivers a substantial performance increase for critical workloads like multi-GPU inference and fine-tuning.

In fact, across all major collectives, the enhanced G4 P2P capability provides an acceleration of up to 2.2x without requiring any changes to the code or workload.

Inference performance boost by P2P on G4

On G4 instances, enhanced peer-to-peer communication directly boosts multi-GPU workload performance, particularly for tensor parallel inference with vLLM, with up to 168% higher throughput, and up to 41% lower inter-token latency (ITL).

We observe these improvements when using tensor parallelism for model serving, especially when compared to standard non-P2P offerings.

At the same time, G4 coupled with software-defined PCIe and P2P innovation, significantly enhances inference throughput and reduces latency, giving you the control to optimize your inference deployment for your business needs.

Throughput or speed: G4 with P2P lets you choose

The platform-level optimizations on G4 VMs translate directly into a flexible and powerful competitive advantage. For interactive generative AI applications, where user experience is paramount, G4’s P2P technology delivers up to 41% less inter-token latency — the critical delay between generating each part of a response. This results in a noticeably snappier and more reactive end-user experience, increasing their satisfaction with your AI application.

Alternatively, for workloads where raw throughput is the priority, such as batch inference, G4 with P2P enables customers to serve up to 168% more requests than comparable offerings. This means you can either increase the number of users served by each model instance, or significantly improve the responsiveness of your AI applications. Whether your focus is on latency-sensitive interactions or high-volume throughput, G4 provides a superior return on investment compared to other NVIDIA RTX PRO 6000 offerings in the market.

Scale further with G4 and GKE Inference Gateway

While P2P optimizes performance for a single model replica, scaling to meet production demand often requires multiple replicas. This is where the GKE Inference Gateway really shines. It acts as an intelligent traffic manager for your models, using advanced features like prefix-cache-aware routing and custom scheduling to maximize throughput and slash latency across your entire deployment.

By combining the vertical scaling of G4's P2P with the horizontal scaling of the Inference Gateway, you can build an end-to-end serving solution that is exceptionally performant and cost-effective for the most demanding generative AI applications. For instance, you can use G4's P2P to efficiently run a 2-GPU Llama-3.1-70B model replica with 66% higher throughput, and then use GKE Inference Gateway to intelligently manage and autoscale multiple of these replicas to meet global user demand.

G4 P2P supported VM Shapes

Peer-to-peer capabilities for NVIDIA RTX PRO 6000 Blackwell are available with the following multi-GPU G4 VM shapes:

Machine Type	GPUs	Peer-to-Peer	GPU Memory (GB)	vCPUs	Host Memory (GB)	Local SSD (GB)
g4-standard-96	2	Yes	192	96	360	3,000
g4-standard-192	4	Yes	384	192	720	6,000
g4-standard-384	8	Yes	768	384	1,440	12,000

For VM shapes smaller than 8 GPUs, our software defined PCIe fabric ensures path isolation between GPUs assigned to different VMs on the same physical machine. PCIe paths are created dynamically at VM creation and are dependent on the VM shape, ensuring isolation on multiple levels of the platform stack to prevent communication between GPUs that are not assigned to the same VM.

Get started with P2P on G4

The G4 peer-to-peer capability is transparent to the workload, and requires no changes to the application code or to libraries such as the NVIDIA Collective Communications Library (NCCL). All peer-to-peer paths are automatically set up during VM creation. You can find more information about enabling peer-to-peer for NCCL-based workloads in the G4 documentation.

Try Google Cloud G4 VMs with P2P from the Google Cloud console today, and start building your inference platform with GKE Inference Gateway. For more information, please contact your Google Cloud sales team or reseller.

Open-source and enterprise-ready: IBM Spectrum Symphony connectors for Google Cloud

Tue, 14 Oct 2025 16:00:00 +0000

At Google Cloud, we are committed to helping customers deploy their high performance computing (HPC) grid workloads to our platform. Today, we are thrilled to announce the general availability of open-source IBM Spectrum Symphony HostFactory connectors for Google Compute Engine and Google Kubernetes Engine (GKE).

This integration between Google Cloud and IBM Spectrum Symphony gives you access to the benefits of Google Cloud for your grid workloads by supporting common architectures and requirements, namely:

Extending your on-premises cluster to Google Cloud and automatically adding compute capacity to reduce execution time of your jobs, or
Deploying an entire cluster in Google Cloud and automatically provisioning and decommissioning compute resources based on your workloads

These connectors are provided in the form of IBM Spectrum Symphony HostFactory custom cloud providers. They are open source and can be easily deployed either via Cluster Toolkit or manually.

Partner-built and tested for enterprise scale

To deliver robust, production-ready connectors, we collaborated with key partners who have deep expertise in financial services and HPC. Accenture built the Compute Engine and GKE connectors and Aneo performed rigorous user acceptance testing to ensure they met the stringent demands of our enterprise customers.

“Accenture is proud to have collaborated with Google Cloud to help develop the IBM Spectrum Symphony connectors. Our expertise in both financial services and cloud solutions allows us to enable customers to seamlessly migrate their critical HPC workloads to Google Cloud's high-performance infrastructure." - Keith Jackson, Managing Director - Financial Services, Accenture

“At Aneo, we subjected the IBM Spectrum Symphony connectors to rigorous, large-scale testing to ensure they meet the demanding performance and scalability requirements of enterprise HPC. We validated the connector's ability to efficiently manage up to 5,000 server nodes, confirming its readiness for production workloads." - William Simon Horn, Cloud HPC Engineer, and Wilfried Kirschenmann, CTO, Aneo

Google Cloud rapidly scales to meet extreme HPC demands, provisioning over 100,000 vCPUs across 5,000 compute pods in under 8 minutes with the new IBM Spectrum Symphony connector for GKE. IBM has tested and supports Spectrum Symphony up to 5,000 compute nodes, so we set this as our target for scale testing the new GCP connector.

The GCE connector demonstrates excellent provisioning speed and stability up to the mid-scale range. The connector also successfully scales to over 5,000 nodes and 125,000 vCPUs in less than 2 minutes.

We achieved this performance by leveraging innovative GKE features like image preloading and custom compute classes, enabling customers in demanding sectors like FSI to accelerate mission-critical workloads while optimizing for cost and hybrid cloud flexibility.

Powerful features to run your way

The connectors are built to provide the flexibility and control needed to manage complex HPC environments. They are available as open-source software in a Google-owned repository. Key features include:

Support for Compute Engine and GKE: Separate IBM Spectrum Symphony Host Factory cloud providers for Compute Engine and GKE allow you to scale your cluster across both virtual machines and containerized environments.
Flexible consumption models: Support for Spot VMs, on-demand VMs, or a mix of both lets you optimize cost and performance.
Template-based provisioning: Use configurable resource templates that align with your workloads requirements.
Comprehensive instance support: Full integration with managed instance group (MIG) APIs, GPUs, Local SSD, and Confidential Computing VMs.
Event-driven management: Pub/Sub integration allows for event-driven resource management for Compute Engine instances.
Kubernetes-native: The GKE connector uses a custom Kubernetes operator with Custom Resource Definitions (CRDs) to manage the entire lifecycle of Symphony compute pods. Leverage GKE’s scaling capabilities and custom hardware like GPUs and TPUs through transparent compatibility with GKE custom compute classes (CCC) and Node Pool Autoscaler.
High-scalability: The connectors are built for high-performance with asynchronous operations to handle large-scale deployments.
Resiliency: Automatic detection and handling of Spot VM preemptions helps ensure workload reliability.
Logging and monitoring: Integrated with Google Cloud's operations suite for observability and reporting.
Enterprise support: The connectors are supported as a first-party solution by Google Cloud, with an established escalation path to our development partner, Accenture.

Getting started

You can begin using the IBM Spectrum Symphony connectors for Google Cloud today.

Find the connectors in the Google Cloud repository.
Explore the technical documentation, including the reference architecture, to get started.
Contact Google Cloud or your Google Cloud account team to learn more about how to migrate your HPC workloads.

To help ensure your success, we will continue to invest in the solutions you need to accelerate your research and business goals. We look forward to seeing what you can achieve with the scale and power of Google Cloud.

5 best practices for Managed Lustre on Google Kubernetes Engine

Fri, 19 Sep 2025 16:00:00 +0000

Google Kubernetes Engine (GKE) is a powerful platform for orchestrating scalable AI and high-performance computing (HPC) workloads. But as clusters grow and jobs become more data-intensive, storage I/O can become a bottleneck. Your powerful GPUs and TPUs can end up idle, while waiting for data, driving up costs and slowing down innovation.

Google Cloud Managed Lustre is designed to solve this problem. Many on-premises HPC environments already use parallel file systems, and Managed Lustre makes it easier to bring those workloads to the cloud. With its managed Container Storage Interface (CSI) driver, Managed Lustre and GKE operations are fully integrated.

Optimizing your move to a high-performance parallel file system can help you get the most out of your investment from day one.

Before deploying, it's helpful to know when to use Managed Lustre versus other options like Google Cloud Storage. For most AI and ML workloads, Managed Lustre is the recommended solution. It excels in training and checkpointing scenarios that require very low latency (less than a millisecond) and high throughput for small files, which keeps your expensive accelerators fully utilized. For data archiving or workloads with large files (over 50 MB) that can tolerate higher latency, Cloud Storage FUSE with Anywhere Cache can be another choice.

Based on our work with early customers and the learnings from our teams, here are five best practices to ensure you get the most out of Managed Lustre on GKE.

aside_block: <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x7fce1e3b85e0>), ('btn_text', ''), ('href', ''), ('image', None)])]>

1. Design for data locality

For performance-sensitive applications, you want your compute resources and storage to be as close as possible, ideally within the same zone in a given region. When provisioning volumes dynamically, the volumeBindingMode parameter in your StorageClass is your most important tool. We strongly recommend setting it to WaitForFirstConsumer. GKE provides a built-in StorageClass for Managed Lustre that uses WaitForFirstConsumer binding mode by default.

Generated yaml:

code_block: <ListValue: [StructValue([('code', 'apiVersion: storage.k8s.io/v1\r\nkind: StorageClass\r\nmetadata:\r\n name: lustre-regional-wait\r\nprovisioner: lustre.csi.storage.gke.io\r\nvolumeBindingMode: WaitForFirstConsumer\r\n...'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1e3b8c70>)])]>

Why it’s a best practice: Using WaitForFirstConsumer instructs GKE to delay the provisioning of the Lustre instance until a pod that needs it is scheduled. The scheduler then uses the pod's topology constraints (i.e., the zone it's scheduled in) to create the Lustre instance in that exact same zone. This guarantees co-location of your storage and compute, minimizing network latency.

2. Right-size your performance with tiers

Not all high-performance workloads are the same. Managed Lustre offers multiple performance tiers (read and write throughput in MB/s per TiB of storage) so you can align cost directly with your performance requirements.

1000 & 500 MB/s/TiB: Ideal for throughput-critical workloads like foundation model training or large-scale physics simulations where I/O bandwidth is the primary bottleneck.
250 MB/s/TiB: A balanced, cost-effective tier great for many general HPC workloads and AI inference serving, and data-heavy analytics pipelines.
125 MB/s/TiB: Best for large-capacity use cases where having a massive, POSIX-compliant file system is more important than achieving peak throughput. This is also useful for migrating on-premises containerized applications without modification, making it easier to migrate on-premises workloads to the cloud storage.

Why it’s a best practice: Defaulting to the highest tier isn't always the most cost-effective strategy. By analyzing your workload’s I/O profile, you can significantly optimize your total cost of ownership.

3. Master your networking foundation

A parallel file system is a network-attached resource. Getting the networking right up front will save you days of troubleshooting. Before provisioning, ensure your VPC is correctly configured by following the setup steps in our documentation. This involves three key steps detailed in our documentation:

Enable Service Networking.
Create an IP range for VPC peering.
Create a firewall rule to allow traffic from that range on the Lustre network port (TCP 988 or 6988).

Why it’s a best practice: This is a one-time setup per VPC that establishes the secure peering connection that allows your GKE nodes to communicate with the Managed Lustre service.

4. Use dynamic provisioning for simplicity, static for long-lived shared data

The Managed Lustre CSI driver supports two modes for connecting storage to your GKE workloads.

Dynamic provisioning: Use when your storage is tightly coupled to the lifecycle of a specific workload or application. By defining a StorageClass and PersistentVolumeClaim (PVC), GKE will automatically manage the Lustre instance lifecycle for you. This is the simplest, most automated approach.
Static provisioning: Use when you have a long-lived Lustre instance that needs to be shared across multiple GKE clusters and jobs. You create the Lustre instance once, then create a PersistentVolume (PV) and PVC in your cluster to mount it. This decouples the storage lifecycle from any single workload.

Why it’s a best practice: Thinking about your data’s lifecycle helps you choose the right pattern. Use dynamic provisioning as your default because of simplicity, and opt for static provisioning when you need to treat your file system as a persistent, shared resource across your organization.

5. Architecting for parallelism with Kubernetes Jobs

Many AI and HPC tasks, like data preprocessing or batch inference, are suited for parallel execution. Instead of running a single, large pod, use the Kubernetes Job resource to divide the work across many smaller pods.

Consider this pattern:

Create a single PersistentVolumeClaim for your Managed Lustre instance, making it available to your cluster.
Define a Kubernetes job with parallelism set to a high number (e.g., 100).
Each pod created by the Job mounts the same Lustre PVC.
Design your application so that each pod works on a different subset of the data (e.g., processing a different range of files or data chunks).

Why it’s a best practice: In this pattern, you create a single PVC for your Lustre instance and have each pod created by the Job mount that same PVC. By designing your application so that each pod works on a different subset of the data, you turn your GKE cluster into a powerful, distributed data processing engine. The GKE Job controller acts as the parallel task orchestrator, while Managed Lustre serves as the high-speed data backbone, allowing you to achieve massive aggregate throughput.

Get started today

By combining the orchestration power of GKE with the performance of Managed Lustre, you can build a truly scalable and efficient platform for AI and HPC. Following these best practices will help you create a solution that is not only powerful, but also efficient, cost-effective, and easy to manage.

Ready to get started? Explore the Managed Lustre documentation, and provision your first instance today.

Accelerate your AI workloads with the Google Cloud Managed Lustre

Tue, 08 Jul 2025 17:00:00 +0000

Today, we're making it even easier to achieve breakthrough performance for your AI/ML workloads: Google Cloud Managed Lustre is now GA, and available in four distinct performance tiers that deliver throughput ranging from 125 MB/s, 250 MB/s, 500 MB/s, to 1000 MB/s per TiB of capacity — with the ability to scale up to 8 PB of storage capacity. The Managed Lustre solution is powered by DDN’s EXAScaler, combining DDN's decades of leadership in high-performance storage with Google Cloud's expertise in cloud infrastructure.

Managed Lustre provides a POSIX-compliant, parallel file system that delivers consistently high throughput and low latency, essential for:

High-throughput inference: For applications that require near-real-time inference on large datasets, Lustre provides high parallel throughput and sub-millisecond read latency.
Large-scale model training: Accelerate the training cycles of deep learning models by providing rapid access to petabytes-sized datasets. Lustre's parallel architecture ensures GPUs and TPUs are fed with data, minimizing idle time.
Checkpointing and restarting large models: Save and restore the state of large models during training faster, improving goodput and allowing for more efficient experimentation.
Data preprocessing and feature engineering: Process raw data, extract features, and prepare datasets for training, reducing the time spent on data pipelines.
Scientific simulations and research: Beyond AI/ML, Lustre excels in traditional HPC scenarios like computational fluid dynamics, genomic sequencing, and climate modeling, where massive datasets and high-concurrency access are critical.

Lustre is designed for the highly parallel and random I/O that characterizes many AI/ML training and inference tasks. This parallel processing capability across multiple clients ensures your compute resources are never starved for data.

Performance tiers and pricing

Managed Lustre offers flexible pricing and performance tiers designed to meet the diverse needs of your workloads, whether you're focused on capacity or highest throughput density.

Throughput MB/s per TiB of storage capacity	Storage pricing per GiB per month
125	$0.145
250	$0.21
500	$0.34
1000	$0.60

Please see more details at the Managed Lustre pricing page.

Irrespective of the aggregate throughput, all tiers come with sub-millisecond read latency, high single-stream throughput, and are perfect for parallel access to many small files.

Driving innovation together: partnering with DDN

Google Cloud’s Managed Lustre is powered by DDN’s EXAScaler, bringing together two industry leaders in high-performance computing and elastic cloud infrastructure. This partnership represents a joint commitment to simplifying the deployment and management of large-scale AI and HPC workloads in the cloud, thanks to:

Trusted leaders: By combining DDN's decades of expertise in high-performance Lustre with Google Cloud's global infrastructure and AI ecosystem, we are delivering a foundational capability that removes storage bottlenecks and helps our customers solve their most complex challenges in AI and HPC.
Fully managed and supported solution: Enjoy the benefits of a fully managed service from Google, with comprehensive support from both Google and DDN, for seamless operations and peace of mind.
Global availability and ecosystem integration: Managed Lustre is now globally accessible in multiple Google Cloud regions and integrates with the broader Google Cloud ecosystem, including Google Kubernetes Engine (GKE) and TPUs.

These benefits caught the attention of one of our largest partners, NVIDIA, who is looking forward to having it as part of its NVIDIA AI platform.

"Enterprises today demand AI infrastructure that combines accelerated computing with high-performance storage solutions to deliver uncompromising speed, seamless scalability and cost efficiency at scale. Google and DDN’s collaboration on Google Cloud Managed Lustre creates a better-together solution uniquely suited to meet these needs. By integrating DDN’s enterprise-grade data platforms and Google’s global cloud capabilities, organizations can readily access vast amounts of data and unlock the full potential of AI with the NVIDIA AI platform (or NVIDIA accelerated computing platform) on Google Cloud — reducing time-to-insight, maximizing GPU utilization, and lowering total cost of ownership.” - Dave Salvator, Director of Accelerated Computing Products, NVIDIA

Get started today!

Ready to supercharge your AI/ML and HPC workloads? Getting started with Managed Lustre is simple:

Navigate to Managed Lustre in the Google Cloud console.
Provision your Managed Lustre instance, choosing the performance tier and size that best fits your needs.
Connect your compute instances, GKE clusters to your new high-performance file system.

For detailed instructions and documentation, please visit the Managed Lustre documentation. And if needed, reach out to Google Cloud sales specialists.

Watch the Fireside Chat

Don't miss the opportunity to learn more about the strategic partnership between Google Cloud and DDN, and the unique capabilities of Managed Lustre. Read the official DDN press release here.

Watch the fireside chat with Sameet Agarwal, VP/GM Storage and Sven Oehme, CTO of DDN, here.

SandboxAQ: Accelerating drug discovery through cloud integration

Tue, 29 Apr 2025 15:00:00 +0000

The traditional drug discovery process involves massive capital investments, prolonged timelines, and is plagued with daunting failure rates. From initial research to obtaining regulatory approval, bringing a new drug to market can take decades. During this time, many drug candidates that had seemed very promising fail to deliver, either due to inefficacy or safety concerns. Only a small fraction of candidates successfully make it through clinical trials and regulatory hurdles.

Enter SandboxAQ, which is helping researchers explore vast chemical spaces, gain deep insights into molecular interactions, and predict biological outcomes with precision. It does so with cutting-edge computational approaches such as active learning, absolute free energy perturbation solution (AQFEP), generative AI, structural analysis, and predictive data analytics, ultimately reducing drug discovery and development timelines. And it does all this on a cloud-native foundation.

Drug design involves an iterative cycle of designing, synthesizing, and testing molecules referred to as the Design-Make-Test cycle. Many customers approach SandboxAQ during the design phase, often when their computational methods are falling short. By improving and accelerating this part of the cycle, SandboxAQ helps medicinal chemists bring innovative and effective molecules to market. For example, in a project related to neurodegenerative disease, SandboxAQ’s approach expanded chemical space from 250,000 to 5.6 million molecules, achieving a 30-fold increase in hit rate and dramatically accelerating the discovery of candidate molecules.

Cloud-native development for scientific insight

SandboxAQ’s software relies on large-scale computation and to maximize flexibility and scale, they use a cloud strategy, which includes Google Cloud infrastructure and tools.

The technologies in large-scale virtual screening campaigns need to be agile and scale cost-effectively. Specifically, SandboxAQ engineers need to be able to quickly iterate on scientific code, immediately run that code at scale cost-effectively, and store and organize all of the data it produces.

SandboxAQ achieved a significant boost in efficiency and scalability with Google Cloud infrastructure. They scaled their computational throughput by 100X to leverage tens of thousands of virtual machines (VMs) in parallel. They also improved utilization by reducing idle time by 90%. By consolidating development and deployment on Google Cloud, SandboxAQ streamlined its workflows, from code development and testing to large-scale batch processing and machine-learning model training.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7fce1c672af0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

All of SandboxAQ’s development and deployment takes place in the cloud. Code and data live in cloud-based services, and development is done on a cloud-based platform that provides scientists and engineers with self-service VMs with standardized and centrally maintained environments and tools. This is important, because scientific code often requires heavy-duty computing hardware. Scientists have access to hefty 96-core machines, or instances with large GPUs. They can also create new machines with alternate configurations or CPU types as depicted below, enabling low-friction testing and development processes across heterogeneous resources.

SandboxAQ scientists and developers manage and access their Bench machines (see above) using the company’s `bench` client. They can connect to machines via SSH or use any number of managed tools, for example a browser-based VNC service for instant remote desktop, or JupyterLab for a familiar notebook development flow.

As code is ready to be run at a larger scale, researchers can dispatch SandboxAQ parameterized sets of computations as jobs on an internal tool powered by Batch, a fully managed service to schedule, queue, and execute batch jobs on Google infrastructure. With development and batch runtime environments closely synced, changes can be quickly run at scale. Code developed on bench machines is pushed to GitHub and immediately available for batch execution. Then, as tools are reviewed and merged into `main` of the company’s monorepo, the new tools become automatically available on SandboxAQ scientists’ bench machines, who can launch parallel jobs processing millions of molecules on any kind of Google Cloud VM resource in any global zone, utilizing either on-demand or Spot VMs.

SandboxAQ's implementation of a globally resolved transitive dependency tree, enables simple package and dependency management. With this practice, Google Batch can seamlessly integrate with individual tools developed by engineers to train many instances of a model in parallel.

Machine learning is a core component of SandoxAQ’s strategy, making easy data access especially important. At the same time, SandboxAQ’s Drug Discovery team also works with clients who have sensitive data. To secure customers’ data, bench and batch workloads read and write data from a unified interface that’s managed via IAM, allowing granular control of different data sources within the organization.

Meanwhile, Google Cloud services like Cloud Logging, Cloud Monitoring, Compute Engine and Cloud Run make it simple to develop tools to monitor these workloads, easily surface logs to SandboxAQ scientists, and comb through huge amounts of output data. As new features are tested or bugs show up, changes are made immediately available to the scientific team, without having to wrangle infrastructure. Then, as code becomes stable, they can incorporate it into downstream production applications, all in a centrally secured, unified way on Google Cloud.

In short, having a unified development, batch compute, and production environment on Google Cloud reduces the friction SandboxAQ faces to develop new workloads and run them at scale. With shared environments for scientific workload development and engineering, SandboxAQ makes it quick and easy for customers to move from experimentation to production, delivering the results customers want, fast.

SandboxAQ solution in the real world

SandboxAQ is already having a profound impact on drug discovery programs targeting a range of hard-to-treat diseases. For example, there are advanced collaborations with Professor Stanley Pruisner's lab at University of California San Francisco (UCSF), Riboscience, Sanofi, and with the Michael J Fox Foundation, to name a few. With this approach built on Google CloudSandboxAQ has achieved a superior hit rate compared to other methods like high throughput screening, demonstrating the transformative potential of SandboxAQ on drug discovery and bringing cures to patients faster.

Visit the Google Cloud AI Hypercomputer web page to learn about Google Cloud AI infrastructure.

H4D VMs: Next-generation HPC-optimized VMs

Thu, 10 Apr 2025 16:00:00 +0000

At Google Cloud Next, we introduced H4D VMs, our latest machine type for high performance computing (HPC). Building upon existing HPC offerings, H4D VMs are designed to address the evolving needs of demanding workloads in industries such as manufacturing, weather forecasting, EDA, and healthcare and life sciences.

H4D VMs are powered by the 5th Generation AMD EPYC^TM Processors, offering improved whole-node VM performance of more than 12,000 gflops and improved memory bandwidth of more than 950 GB/s. H4D provides low-latency and 200 Gbps network bandwidth using Cloud Remote Direct Memory Access (RDMA) on Titanium, the first of our CPU-based VMs to do so. This powerful combination enables you to efficiently scale your HPC workloads and achieve insights faster.

VM and core performance, as well as memory bandwidth for H4D vs. C2D and C3D, showing generational improvement

For open-source High-Performance Linpack (OSS-HPL), a widely-used benchmark for measuring the floating-point computing power of supercomputers, H4D offers 1.8x higher performance per VM and 1.6x higher performance per core compared to C3D. Additionally, H4D offers 5.8x higher performance per VM and 1.7x higher performance per core compared to C2D.

For STREAM Triad, a benchmark to measure memory bandwidth, H4D offers 1.3x higher performance per VM and 1.4x higher performance per core compared to C3D. Additionally, H4D offers 3x higher performance per VM and 1.4x higher performance per core compared to C2D.

aside_block: <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud infrastructure'), ('body', <wagtail.rich_text.RichText object at 0x7fce1e1a5e20>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectPath=/compute'), ('image', None)])]>

Improved HPC application performance

H4D VMs deliver strong compute performance and memory bandwidth, significantly outperforming previous generations of AMD-based VMs like C2D and C3D, allowing for faster simulations and analysis, and delivering significant performance gains (relative to a prior generation AMD-based HPC VM, C2D) across various HPC applications and benchmarks, as illustrated below:

Manufacturing
- CFD apps like Siemens^TM Simcenter STAR-CCM+^TM/HIMach show up to 3.6x improvement.
- CFD apps like Ansys Fluent/f1_racecar_140 show up to 3.6x improvement.
- FEA Explicit apps like Altair Radioss/T10m show up to 3.6x improvement.
- CFD apps like OpenFoam/Motorbike_20m show up to 2.9x improvement.
- FEA Implicit apps like Ansys Mechanical/gearbox shows up to 2.7x improvement.
Healthcare and life sciences:
- Molecular Dynamics (GROMACS) shows up to 5x improvement.
Weather forecasting
- Industry standard benchmark WRFv4 shows up to 3.6x improvement.

Figure 2: Single VM HPC Application performance (speed-up) of H4D, C3D and C2D relative to C2D. Applications ran on single VMs using all cores.

“Our deep collaboration with Google Cloud powers the next generation of cloud-based HPC with the announcement of the new H4D VMs. Google Cloud has leveraged the architectural advances of our 5^th Gen AMD EPYC CPUs to create an offering that delivers impressive performance uplift compared to previous generations across a variety of HPC benchmarks. This will empower customers to achieve fast insights and accelerate their most demanding HPC workloads.” - Ram Peddibhotla, corporate vice president, Cloud Business, AMD

Faster HPC with Cloud RDMA on Titanium

H4D’s performance is made possible with Cloud RDMA, a new Titanium offload that’s available for the first time on these VMs. Cloud RDMA is specifically engineered to support HPC workloads that rely heavily on inter-node communication, such as computational fluid dynamics, weather modeling, molecular dynamics, and more. By offloading network processing, Cloud RDMA provides predictable, low-latency, high-bandwidth communication between compute nodes, thus minimizing host CPU bottlenecks.

Under the hood, Cloud RDMA uses Google’s innovative Falcon hardware transport for reliable, low-latency communication over our Ethernet-based data center networks, effectively resolving the traditional challenges of RDMA over Ethernet while helping to ensure predictable, high performance at scale.

Cloud RDMA over Falcon speeds up simulations by efficiently utilizing more computational resources. For example, for smaller CFD problems like OpenFoam/motorbike_20m and Simcenter Star-CCM+/HIMach10, which have limited inherent parallelism and are typically challenging to accelerate, H4D results in 3.4x and 1.9x speedup, respectively, on four VMs compared to TCP.

Figure 3: Left: OpenFoam/Motorbike_20m offers a 3.4x improvement with H4D Cloud RDMA over TCP at four VMs. Right: Simcenter STAR-CCM+/HIMach10 offers a 1.9x improvement with H4D Cloud RDMA over TCP at four VMs.

For larger models, Falcon also helps maintain strong scaling. Using 32 VMs, Falcon achieved a 2.8x speedup over TCP for GROMACS/Lignocellulose and a 1.3x speedup for WRFv4/Conus 2.5km.

Figure 4: Left: GROMACS/Lignocellulose offers a 2.8x improvement with H4D Cloud RDMA over TCP at 32 VMs. Right: WRFv4/Conus 2.5km offers a 1.3x improvement with H4D Cloud RDMA over TCP at 32 VMs.

Cluster management and scheduling capabilities

H4D VMs will support both Dynamic Workload Scheduler (DWS) and Cluster Director (formerly known as Hypercompute Cluster).

DWS helps schedule HPC workloads for optimal performance and cost-effectiveness, providing resource availability for time-sensitive simulations and flexible HPC jobs.

Cluster Director, which lets you deploy and scale a large, physically-colocated accelerator cluster as a single unit, is now extending its capabilities to HPC environments. Cluster Director simplifies deploying and managing complex HPC clusters on H4D VMs by allowing researchers to easily set up and run large-scale simulations.

VM sizes and regional availability

We offer H4D VMs in both standard and high-memory configurations to cater to diverse workload requirements. We also provide options with local SSD for workloads that demand high-speed storage, such as CPU-based seismic processing and structural mechanics applications (e.g., Abaqus, NASTRAN, Altair OptiStruct and Ansys Mechanical).

VM	Cores	Memory	Local SSD
h4d-highmem-192-lssd	192	1488	3.75TB
h4d-standard-192	192	720	N/A
h4d-highmem-192	192	1488	N/A

H4D VMs are currently available in us-central1-a (Iowa), and europe-west4-b (Netherlands), with additional regions in progress.

What our customers and partners are saying

"With the power of Google's new H4D-based clusters, we are poised to simulate systems approaching a trillion particles, unlocking unprecedented insights into circulatory functions and diseases. This leap in computational capability will dramatically accelerate our pursuit of breakthrough therapeutics, bringing us closer to effective precision therapies for blood vessel damage in heart disease." - Petros Koumoutsakos, Jr. Professor of Computing in Science and Engineering, Harvard University

“The launch of Google Cloud's H4D platform marks a significant advancement in engineering simulation. As GCP’s first VM with RDMA over Ethernet, combined with higher memory bandwidth, generous L3 cache, and AVX-512 instruction support, H4D delivers up to 3.6x better performance for Ansys Fluent simulations compared to C2D VMs. This performance boost allows our customers to run simulations faster, explore a wider range of design options, and drive innovation with greater efficiency.” - Wim Slagter, Senior Director of Partner Programs, Ansys

"The generational performance leap achieved with Google H4D VMs, powered by the 5th Generation AMD EPYC™, is truly remarkable. For compute-intensive, highly non-linear simulations, such as car crash analysis, Altair® Radioss® delivers a stunning 3.6x speedup. This breakthrough paves the way for faster and more accurate simulations, which is crucial for our customers in the era of the digital thread!” – Eric Lequiniou, SVP Radioss Development and Altair Solvers HPC

“The latest H4D VMs, powered by 5th Generation AMD EPYC Processors and Cloud RDMA, allow our customers to realize faster time-to-results for their Simcenter STAR-CCM+ simulations. For HIMach10, we’re seeing up to 3.6x performance gains compared to the C2D instance and 1.9x speedup on four H4D Cloud RDMA VMs compared to TCP. Our partnership with Google has been key to achieving these reduced simulation times.” - Lisa Mesaros, Vice President, Simcenter Solution Domains Product Management, Siemens

Want to try it out?

We're excited to see how H4D VMs will empower you to achieve faster results with your HPC workloads! Sign up for the preview by filling out this form.

Colossus: the secret ingredient in Rapid Storage’s high performance

Thu, 10 Apr 2025 16:00:00 +0000

As an object storage service, Google Cloud Storage is popular for its simplicity and scale, a big part of which is due to the stateless REST protocols that you can use to read and write data. But with the rise of AI and as more customers look to run data-intensive workloads, two major obstacles to using object storage are its higher latency and lack of file-oriented semantics. With the launch of Rapid Storage on Google Cloud, we’ve added a stateful gRPC-based streaming protocol that provides sub-millisecond read/write latency and the ability to easily append data to an object, while maintaining the high aggregate throughput and scale of object storage. In this post, we’ll share an architectural perspective into how and why we went with this approach, and the new types of workloads it unlocks.

It all comes back to Colossus, Google’s internal zonal cluster-level file system that underpins most (if not all) of our products. As we discussed in a recent blog post, Colossus supports our most demanding performance-focused products with sophisticated SSD placement techniques that deliver low latency and massive scale.

Another key ingredient in Colossus’s performance is its stateful protocol — and with Rapid Storage, we’re bringing the power of the Colossus stateful protocol directly to Google Cloud customers.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7fce1d471730>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

When a Colossus client creates or reads a file, the client first opens the file and gets a handle, a collection of state that includes all the information about how that file is stored, including which disks the file’s data is stored on. Clients can use this handle when reading or writing to talk directly to the disks via an optimized RDMA-like network protocol, as we previously outlined in our Snap networking system paper.

Handles can also be used to support ultra-low latency durable appends, which is extremely useful for demanding database and streaming analytics applications. For example, Spanner and Bigtable both write transactions to a log file that requires durable storage and that is on the critical path for database mutations. Similarly, BigQuery supports streaming to a table while massively parallel batch jobs perform computations over recently ingested data. These applications open Colossus files in append mode, and the Colossus client running in the application uses the handle to write their database mutations and table data directly to disks over the network. To ensure the data is stored durably, Colossus replicates its data across several disks, performing writes in parallel and using a quorum technique to avoid waiting on stragglers.

Figure 1: Steps involved in appending data to a file in Colossus.

The above image shows the steps that are taken to append data to a file.

The application opens the file in append mode. The Colossus Curator constructs a handle and sends it to the Colossus Client running in-process, which caches the handle.
The application issues a write call for an arbitrary-sized log entry to the Colossus Client.
The Colossus Client, using the disk addresses in the handle, writes the log entry in parallel to all the disks.

Rapid Storage builds on Colossus’s stateful protocol, leveraging gRPC-based streaming for the underlying transport. When performing low-latency reads and writes to Rapid Storage objects, the Cloud Storage client establishes a stream, providing the same request parameters used in Cloud Storage’s REST protocols, such as the bucket and object name. Further, all the time-consuming Cloud Storage operations such as user authorization and metadata accesses are front-loaded and performed at stream creation time, so subsequent read and write operations go directly to Colossus without any additional overhead, allowing for appendable writes and repeated ranged reads with sub-millisecond latency.

This Colossus architecture enables Rapid Storage to support 20 million requests per second in a single bucket — a scale that is extremely useful in a variety of AI/ML applications. For example, when pre-training a model, once data preparation is complete, a randomized set of data samples are fed into GPUs or TPUs, typically in large files that each contain hundreds of millions to billions of tokens. But the data is rarely read sequentially, for example, because different random samples are read in different orders as the training progresses. With Rapid Storage’s stateful protocol, a stream can be established at the start of the training run before executing massively parallel ranged-reads at sub-millisecond speeds. This helps to ensure that accelerators aren’t blocked on storage latency.

Likewise, with appends, Rapid Storage takes advantage of Colossus’s stateful protocol to provide durable writes with sub-millisecond latency, and supports unlimited appends to a single object up to the object size limit. A major challenge with stateful append protocols is how to handle cases where the client or server hangs or crashes. With Rapid Storage, the client receives a handle from Cloud Storage when creating the stream. If the stream gets interrupted but the client wants to continue reading or appending to the object, the client can re-establish a new stream using this handle, which streamlines this flow and minimizes any latency hiccups. It gets trickier when there is a problem on the client, and the application wants to continue appending to an object from a new client. To simplify this, Rapid Storage guarantees that only one gRPC stream can write to an object at a time; each new stream takes over ownership of the object, transactionally locking out any prior stream. Finally, each append operation includes the offset that’s being written to, ensuring that data correctness is always preserved even in the face of network partitions and replays.

Figure 2: A new client taking over ownership of an object.

In the above image, a new client takes over ownership of an object, locking out the previous owner.

Initially, client 1 appends data to an object stored on three disks.
The application decides to fail over to client 2, which opens this object in append mode. The Colossus Curator transactionally locks out client 1 by increasing a version number on each object data replica.
Client 1 attempts to append more data to the object, but cannot because its ownership was tied to the old version number.

To make it as easy as possible to integrate Rapid Storage into your applications, we are also updating our SDKs to support gRPC streaming-based appends and expose a simple application-oriented API. Writing data using handles is a familiar concept in the filesystems world, so we’ve integrated Rapid Storage into Cloud Storage FUSE, which provides clients with file-like access to Cloud Storage buckets, for low-latency file-oriented workloads. Rapid Storage also natively enables Hierarchical Namespace as part of its zonal bucket type, providing enhanced performance, consistency, and folder-oriented APIs.

In short, Rapid Storage combines the sub-millisecond latency of block-like storage, the throughput of a parallel filesystem, and the scalability and ease of use of object storage, and it does all this in large part due to Colossus. Here are some interesting workloads we've seen our customers explore during the preview:

AI/ML data preparation, training, and checkpointing
Distributed database architecture optimization
Batch and streaming analytics processing
Video live-streaming and transcoding
Logging and monitoring

Interested in trying Rapid Storage? Indicate your interest here or reach out through your Google Cloud representative.

Visit us at Google Cloud Next and attend the breakout sessions “What’s new with Google Cloud’s Storage” (BRK2-025), “AI Hypercomputer: Mastering your Storage Infrastructure” (BRK2-020), and “Under the Iceberg: Simple, unified Cloud Storage for analytics data lakes” (BRK2-026) to learn more.

Enabling global scientific discovery and innovation on Google Cloud

Thu, 10 Apr 2025 16:00:00 +0000

From unraveling the mysteries of our planet and the universe, to accelerating medical research and industrial innovation, scientific discovery impacts nearly every facet of human life. Today, scientific progress depends on the interplay of theory, experimentation, and computation, and increasingly, the most important and challenging problems require high-performance computing (HPC) and other advanced computing technologies and techniques.

In recent years, artificial intelligence (AI) has emerged as a powerful tool for information assessment and generation, while also becoming a powerful tool for scientific discovery, business innovation, and productivity. More recently, advances in quantum computing are increasing our confidence in shortening the timelines to solving problems beyond the reach of classical computers. Quantum computers under development now will lead to larger production systems that will catalyze the creation of new drugs and materials, reduce costs and risks in complex financial and logistics scenarios, and enable the development of more capable AI models.

At Google, our vision is to be the most comprehensive, capable, and accessible platform for science. Since 2008, Google Cloud has powered scientific discoveries, providing computational and data storage capabilities — including HPC clusters — to scientists, engineers, and developers worldwide. And this week, to enable continued revolutionary new science, we are bringing the best of Google DeepMind and Google Research together with new infrastructure and AI capabilities in Google Cloud, providing researchers with highly capable, cloud-scale tools for scientific computing. These new capabilities include:

Supercomputing-class infrastructure for scientific computing: Researchers can now deploy and use supercomputing clusters powered by the latest H4D VMs powered by AMD CPUs, and A4/A4X VMs powered by the latest NVIDIA GPUs. These VMs have new low-latency networking that provides supercomputer-like scaling and performance. We’re also announcing Google Cloud Managed Lustre for high performance storage I/O. These resources will enable scientists to tackle large-scale, complex science problems.
Advanced scientific applications powered by AI models for weather forecasting and biology: We’re now offering our first AI-powered science applications for the broader science community: AlphaFold 3 for predicting the structure and interactions of biomolecules, and WeatherNext models for weather forecasting.
AI agents for quicker ideas and faster discovery: Two new AI agents in Google Agentspace – Deep Research and Idea Generation – can help prepare comprehensive research reports and rapidly generate new scientific hypotheses.

Let’s take a look at these new capabilities in more detail.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7fce1e36de80>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Supercomputing-class infrastructure and tools for science

Supercomputers are designed to achieve maximum performance on very large problems, as well as to train large AI models. With ongoing advances in science and AI, quick and easy access to supercomputing resources is critical.

Researchers can now deploy and use supercomputering-class HPC clusters in Google Cloud based on new H4D VMs (virtual machines), our most powerful CPU-based VMs that use 5th Generation AMD EPYCTM Processors. H4D clusters are connected with Remote Direct Memory Access (RDMA) networking utilizing Google’s Falcon and Titanium offload technologies, providing low-latency communications for HPC applications. By using standard message-passing libraries over RDMA, H4D VMs can efficiently scale applications up to tens of thousands of cores, resulting in faster time-to-solution. You can register for the H4D VM preview here.

Harvard University is using Google Cloud to advance heart disease research by simulating large-scale systems of red blood cells and other structures, including magnetically controlled artificial bacterial flagella (ABF), with the goal of developing therapies to attack and dissolve blood clots and circulating tumor cells in human vasculatures.

Professor Koumoutsakos’ research involves the simulation of blood flowing in a microfluidics device which is designed to capture circulating tumor cells.

HPC clusters based on our recently announced A4 and A4X VMs are also a critical component of our scientific discovery portfolio. A4 VMs, built on NVIDIA’s latest HGX B200 GPUs, are a versatile and powerful tool for multiple scientific computing applications, offering excellent performance for direct numerical simulation, and for AI training. A4X VMs, accelerated by NVIDIA GB200 NVL72 GPUs, are purpose-built for training and serving the most demanding, extra-large-scale AI workloads.

Clusters using these GPU-powered VMs can also unlock supercomputing-class performance for the next frontier of innovation: quantum computing. In the future, quantum computing systems will allow scientists to solve problems that are intractable even with the most powerful traditional supercomputers. In the meantime, HPC clusters based on A-series VMs can be used to design tomorrow’s quantum computers and optimize quantum algorithms, by simulating large quantum circuits using the quantum simulation solution blueprint.

For example, Google Research’s Quantum AI team leverages Google Cloud to simulate the intricate device physics of quantum hardware, develop sophisticated hybrid quantum-classical algorithms, and explore and test novel quantum algorithms. This robust simulation environment facilitates scientific breakthroughs by delivering the performance and scalability essential for demanding quantum research workflows.

"We observed excellent scalability simulating a 43-qubit circuit with a depth of 30 on Google Cloud's new GPU-based supercomputers. These results underscore the potential for researchers to develop and test larger and deeper quantum circuits, which is important for understanding the performance of quantum algorithms and accelerating progress toward applications for today’s quantum computers." - Sergio Boixo, Director, Computer Science, Google Quantum AI

HPC clusters demand high I/O performance to keep computational performance from stalling. Our new Google Cloud Managed Lustre storage service, developed in collaboration with DataDirect Networks and based on EXAScaler technology, provides the I/O performance needed for supercomputing-scale applications. Google Cloud Managed Lustre delivers a high-performance, fully-managed parallel file system optimized for HPC and AI applications. With petabyte-scale capacity and up to 1 TB/s throughput, Managed Lustre ensures researchers have the I/O performance they need to power their scientific discoveries. Request access to the Managed Lustre preview by contacting your account representative.

Advanced scientific applications powered by AI models

We recently announced our first AI-powered science applications for researchers and enterprises on Google Cloud: the groundbreaking AlphaFold 3 molecular structure and interaction prediction model, and the WeatherNext weather forecasting models.

AlphaFold 3, developed by Google DeepMind and Isomorphic Labs, is revolutionizing biology through its ability to predict the structure and interactions of all of life’s molecules with unprecedented accuracy. Understanding molecular structures and their interactions helps researchers better grasp complex interactions in human health and disease. AlphaFold 3 is now available for non-commercial use on Google Cloud.

“Having access to the scientific capabilities of AlphaFold on Google Cloud can help our research rapidly predict and explore the structure and interactions of all biomolecule classes. This change in capability will accelerate our understanding of diseases and enable the generation of therapeutic hypotheses.” - Sumaiya Iqbal, Senior group lead of the Ladders to Cures Accelerator, Broad Institute

To further support users, we’re simplifying access to AlphaFold 3 through a new high-throughput solution deployable via Cluster Toolkit. This turnkey solution enables efficient batch processing of hundreds to tens of thousands of sequences while minimizing costs by autoscaling infrastructure.

In the domain of weather, Google DeepMind and Google Research WeatherNext models use AI for fast and accurate weather forecasting, and we recently released live WeatherNext AI forecasts on BigQuery and Earth Engine. Today, we’re introducing access to WeatherNext AI models via Google Cloud’s Vertex AI Model Garden, enabling practitioners to customize and deploy these advanced models for energy prediction, logistics, agriculture, risk management, and more.

With easier and more affordable access to faster and more accurate weather forecasting models, researchers can study far more scenarios, and organizations can better prepare for weather events — such as heat waves, floods, and hurricanes — to reduce their impact on infrastructure, personnel, supply chains, and communities.

WeatherNext Graph forecasts visualized in Google Earth Engine, showing forecasted wind speed, wind direction, and precipitation as of September 8, 2023. The visualization demonstrates the projected path of Hurricane Lee over the Atlantic Ocean.

For instance, Carrier plans to leverage Google Cloud’s WeatherNext AI models as part of its Home Energy Management System (HEMS) to help enhance grid flexibility and enable smarter energy management. Once deployed, WeatherNext AI models are expected to help HEMS intelligently manage energy flows in real time — charging, discharging, and redirecting energy based on grid conditions, energy demands, and weather forecasts — contributing to a more balanced and sustainable energy grid.

Using AI as the ultimate research partner

Google's robust ecosystem of information, productivity, and advanced AI tools has long helped drive scientific research, providing researchers with information and insight. Google Scholar is an indispensable resource for navigating the vast landscape of scientific literature and for discovering and tracking relevant publications. Then there’s Gemini, which can synthesize, summarize and explain information from highly scientific and technical content. And NotebookLM, an AI-powered research assistant, intelligently processes and summarizes selected research papers and datasets, dramatically accelerating literature reviews and extracting crucial information.

We’re excited to announce two new AI agents in Agentspace that have the potential to further accelerate scientific research and to revolutionize hypothesis generation. Deep Research condenses hours of research by synthesizing information across internal and external sources to generate in-depth research reports. Idea Generation helps rapidly develop novel ideas through AI agents that create ideas, then test them against each other to find the best hypotheses.

Scientists can also leverage AI Studio and Vertex AI on Google Cloud to develop customized AI applications and advanced machine learning workflows. We also recently announced Gemma 3, a collection of lightweight, state-of-the-art open models built from the same research and technology that powers our Gemini 2.0 models. These are our most advanced, portable and responsibly developed open models yet, and can be used to create scientific applications on local devices. Finally, Google Research’s Geospatial Reasoning framework, leveraging Vertex AI Agent Engine, will allow scientists and analysts to unlock powerful insights about the world through new geospatial foundation models and generative AI.

Enabling transformational science today and tomorrow

Together, these new advanced infrastructure, AI applications, and AI productivity technologies provide new cloud-scale scientific capabilities for all kinds of computational science research. Combined with our discovery, collaboration, and productivity tools, we are providing scientists and researchers with a comprehensive array of cloud-powered scientific capabilities.

Argonne National Laboratory, a leading laboratory for open science computational research, is working with Google Cloud to explore how advanced computing technologies and AI tools can empower scientists and engineers to make groundbreaking discoveries faster than ever. Through the collaboration, ANL will use and evaluate Google Cloud solutions for computational research, providing feedback and guidance to further advance the design, performance, and usefulness of Google Cloud for supercomputing-scale science.

“Having access to powerful computational capabilities is critical for making new scientific discoveries and accelerating innovations that power business and society. We are eager to work with Google Cloud to leverage their comprehensive, global-scale AI and HPC infrastructure, software technologies and AI-powered applications such as AlphaFold 3. Argonne National Laboratory’s collaboration with Google Cloud will effectively drive innovation and enable discoveries that change the world — and bring these capabilities to researchers everywhere.” - Rick Stevens, Associate Laboratory Director for Computing, Environment and Life Sciences, Argonne National Laboratory

Scientific discoveries are more important than ever for solving the world’s greatest challenges. At Google, we’re building powerful advanced computing technologies to enable scientific discoveries and innovations, and we are excited to bring all these capabilities together in Google Cloud.

Practitioners can get started today with credits, training, and more with Google Cloud for Researchers. To stay informed and learn more about Google Cloud can help advance scientific research and discovery, join the Google Cloud Advanced Computing Community.

Driving enterprise transformation with new compute innovations and offerings

Wed, 09 Apr 2025 12:00:00 +0000

In the last 12 months, we’ve made incredible enhancements to our Compute Engine platform. This is driven most notably by new fourth-generation compute instances and Hyperdisk block storage as well as major customer experience enhancements. Across all workloads, Google Cloud’s compute portfolio can help you optimize your performance and costs, while delivering enterprise-grade scalability, reliability, security, and workload consistency, helping you grow efficiently and have more to invest for innovation. Let’s explore what we’ll be announcing today at Google Cloud Next 2025.

New and enhanced compute for every workload

C4D offers 80% higher throughput per vCPU and stronger performance
Our new C4D VMs are built on AMD's 5th Gen EPYC processors, paired with Google Titanium's latest advancements, and have a higher core frequency (up to 4.1 GHz). C4D delivers impressive performance gains over prior generations across a wide set of general computing workloads — up to 30% vs C3D on the estimated SPECrate®2017_int_base benchmark — helping you meet the needs of business-critical applications with fewer resources.

For databases, C4D achieves an up to a 55% increase in queries per second on MySQL and a 35% performance improvement for Redis workloads compared to C3D. For web-serving workloads, C4D delivers up to 80% higher throughput per vCPU compared to previous generations, driving faster page rendering and a smoother end-user experience. C4D offers confidential computing and is available in 49 industry-standard shapes, with sizes ranging from 2 vCPU to 384 vCPU in three memory configurations of up to 3TB of DDR5 memory, and will include both our first AMD based bare metal offering and our new Titanium LSSD. Now available in preview, try out C4D in Compute Engine and Google Kubernetes Engine (GKE) today.

“AppLovin, a global leader in mobile advertising, is constantly looking for cutting-edge infrastructure innovations to deliver exceptional performance for our clients. Google Cloud's C4D VMs enable us to do just that — driving a ~40% improvement over the prior generation, which leads to significant efficiency gains and latency reduction.” - Basil Shikin, CTO, AppLovin

aside_block: <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud infrastructure'), ('body', <wagtail.rich_text.RichText object at 0x7fce1c9293d0>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectPath=/compute'), ('image', None)])]>

C4 VMs enable new capabilities and greater flexibility
For demanding, low-latency tasks such as gaming, inference, large-scale data processing, and real-time workloads, our C4 machine series is expanding to enable new capabilities and configurations, including larger shapes, Local SSD, and bare metal. These new C4 shapes, built exclusively on the latest 6th generation Intel Granite Rapids CPUs, feature the highest frequency of any Compute Engine VM — up to 4.2 GHz.

C4 shapes with Titanium Local SSD offer improved performance for I/O-intensive workloads like databases and caching layers, achieving Local SSD latency reductions of up to 35%. New C4 bare metal instances provide performance gains of up to 35% for general compute and up to 65% for ML recommendation workloads compared to the prior generation. The new, larger C4 VM shapes scale up to 288 vCPU, with 2.2TB of high-performing DDR5 memory and larger cache sizes, enabling better scalability for databases, data analytics, and other memory-constrained workloads. Request preview access here.

H4Ds offer tremendous performance improvements for HPC workloads
Scale your HPC workloads and get insights faster than ever before with H4D VMs. These VMs are built on the 5th gen AMD EPYC CPUs and offer the highest whole-node VM performance of more than 12,000 gflops, the highest per-core performance, and the best memory bandwidth of more than 950 GB/s of our VM families. H4D VMs provide 200 Gbps of low latency Titanium RDMA network bandwidth to support clusters with over 10,000 cores and plans for even more scale. Learn more in our Scientific Innovations blog or sign up for the H4D preview.

“The generational performance leap achieved with Google H4D VMs, powered by the 5th Generation AMD EPYC, is truly remarkable. For compute-intensive, highly non-linear simulations such as car crash analysis, Altair Radioss delivers a stunning 3.6x speedup. This breakthrough paves the way for faster and more accurate simulations, which is crucial for our customers in the era of the digital thread!” - Eric Lequiniou, SVP Radioss Development and Altair Solvers HPC

M4 VMs double performance for demanding SAP workloads
Backed by Compute Engine’s memory-optimized 99.95% single instance SLA, M4 VMs offer up to 65% better price-performance and 2.25x more SAP Application Performance Standard (SAPS) compared to our previous memory optimized M3. Built on 5th Generation Intel Xeon Scalable processors, M4 VMs are certified for business-critical, in-memory SAP HANA workloads ranging from 744GB to 3TB, and for SAP NetWeaver Application Server.

Z3 for storage-intensive workloads
For I/O-intensive workloads such as data warehouses, SQL, and NoSQL databases, our Z3 storage-optimized family now features new Titanium SSDs and offers nine new smaller shapes, ranging from 3TB to 18TB per instance. We are also introducing new storage-optimized bare-metal instance which include up to 72TB of Titanium SSDs and direct access to the physical server CPUs. Now in preview, register your interest by signing up here.

Nutanix Cloud Clusters are now on Google Cloud
We’re excited to partner with Nutanix, who selected the new Z3-metal instances to launch Nutanix Cloud Clusters (NC2) on Google Cloud. Nutanix NC2 is a hybrid cloud platform that simplifies the ability to run, manage, and operate apps, data, and AI across private and public clouds. NC2’s common operating model makes it easy to manage workloads in a consistent manner, accelerating customers’ migration to Google Cloud and helping them modernize their apps. Learn more and sign up for public preview.

"We are thrilled to announce the private preview of Nutanix Cloud Clusters on Google Cloud, marking a significant milestone in Nutanix’s commitment to delivering flexible, hybrid cloud solutions. Google Cloud’s Z3 instance types represent a perfect foundation for Nutanix to enable performance and resilience for enterprise applications. We’re excited about our partnership with Google Cloud in empowering our joint customers with greater choice and simplicity in their cloud journey." - Saveen Pakala, Vice President of Product Management, Nutanix

More options to optimize your VMware environment in the cloud
With Google Cloud VMware Engine, we provide one of the fastest ways to lift and transform your existing VMware estate into Google Cloud. Today, we are offering 18 additional node shapes, bringing the total number of node shapes across VMware Engine v1 and v2 to 26 — six times more node shapes than competitors. Now, you have the industry’s widest range of options to shape your capacity to your workloads’ needs and optimize your TCO.

Storage and platform capabilities for greater scale and efficiency

Our fourth-generation compute, networking, and block storage portfolio is built on several highly differentiated foundational technologies. Titanium is a system of purpose-built custom silicon and multiple tiers of scale-out offloads that free up the CPU, enhancing performance, reliability, security and maximizing workload efficiency. It is integrated across our compute, storage, and networking offerings, which you’ve seen in a number of the announcements above.

Recently, we also updated the Titanium ML Adapter to securely integrate NVIDIA ConnectX-7 network interface cards (NICs), providing 3.2 Tbps of non-blocking GPU-to-GPU bandwidth. In addition, Titanium Offload Processors now integrate our GPU clusters with the Jupiter data center fabric, providing greater cluster scale.

Next-generation block storage with Hyperdisk
Hyperdisk is Google Cloud’s workload-optimized, high-performance block storage that’s cost-efficient, easy-to-use and that delivers comprehensive data protection capabilities for your workloads. With unique capabilities like the ability to independently tune capacity and performance specific to your workloads, Hyperdisk Storage Pools enable thin provisioning and data reduction, lowering TCO and simplifying management at scale. As customers move larger and larger workloads, we are expanding Storage Pools to store up to 5 PiB of data in a single pool — a 5x increase from before.

In addition, we are also introducing Hyperdisk Exapools, a new variant of Storage Pools purpose-built for the largest and most demanding AI training workloads. With Hyperdisk Exapools you can provision and manage block storage delivering multiple exabytes of capacity and terabytes per second of throughput for your biggest AI clusters, while leveraging thin-provisioning and data reduction to lower your TCO and simplify management.

Hyperdisk ML has also added new capabilities, including hydrating from Cloud Storage using GKE volume populator, attaching to the latest Compute Engine instances, and performing data loading acceleration from Hyperdisk ML to run training/inference on the latest TPU VM families. Learn more in today’s AI infrastructure blog.

Match resources to your usage patterns

Finally, we’re providing you with greater efficiency, flexibility, and control over demanding computing tasks with managed instance groups (MIGs) — collections of virtual machines that you can manage as a single entity. For example, you can now configure MIGs to use multiple VM types and it automatically finds capacity — even during periods of high demand and rapid growth. You can also use stopped and suspended VMs in a MIG with pre-initialized VMs, to save cost and accelerate application startup. We also introduced committed use discounts (CUDs) and reservation sharing with Vertex AI and Autopilot, letting you purchase infrastructure once and utilize it across multiple services.

Invest for innovation with optimized compute

Delivering infrastructure that provides the highest performance and flexibility for all of your workloads is our top commitment. From general-purpose VMs to specialized solutions for HPC, SAP, and databases, we offer workload-optimized solutions tailored to your needs, helping you unlock the innovation your business needs. Got questions? Get in touch!

What’s new with HPC and AI infrastructure at Google Cloud

Fri, 15 Nov 2024 17:00:00 +0000

At Google Cloud, we’re rapidly advancing our high-performance computing (HPC) capabilities, providing researchers and engineers with powerful tools and infrastructure to tackle the most demanding computational challenges. Here's a look at some of the key developments driving HPC innovation on Google Cloud, as well as our presence at Supercomputing 2024.

You can also stay apprised of our HPC and AI advances by joining the new Google Cloud Advanced Computing Community (details below).

Next-generation HPC VMs

We began our H-series with H3 VMs, specifically designed to meet the needs of demanding HPC workloads. Now, we’re excited to share some key features of the next generation of the H family, bringing even more innovation and performance to the table. The upcoming VMs will feature:

Improved workload scalability via RDMA-enabled 200 Gbps networking
Native support to directly provision full, tightly-coupled HPC clusters on demand
Dynamic Workload Scheduler to provision fixed-lifetime clusters now or in the future
Titanium technology that delivers superior performance, reliability, and security

We provide system blueprints for setting up turnkey, pre-configured HPC clusters on our H series VMs.

The next generation of H series is coming in early 2025.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7fce1dea0820>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Parallelstore: World’s first fully-managed DAOS offering

Parallelstore is a fully managed, scalable, high-performance storage solution based on next-generation DAOS technology, designed for demanding HPC and AI workloads. It is now generally available and provides:

Up to 6x greater read throughput performance compared to competitive Lustre scratch offerings
Low latency (<0.5ms at p50) and high throughput (>1GiB/s per TiB) to access data with minimal delays, even at massive scale
High IOPS (30K IOPS per TiB) for metadata operations
Simplified management that reduces operational overhead with a fully managed service

Parallelstore is great for applications requiring fast access to large datasets, such as:

Analyzing massive genomic datasets for personalized medicine
Training large language models (LLMs) and other AI applications efficiently
Running complex HPC simulations with rapid data access

A3 Ultra VMs with NVIDIA H200 Tensor Core GPUs

For GPU-based HPC workloads, we recently announced A3 Ultra VMs, which feature NVIDIA H200 Tensor Core GPUs. A3 Ultra VMs offer a significant leap in performance over previous generations. They are built on servers with our new Titanium ML network adapter, optimized to deliver a secure, high-performance cloud experience for AI workloads, and powered by NVIDIA ConnectX-7 networking. Combined with our datacenter-wide 4-way rail-aligned network, A3 Ultra VMs deliver non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE).

Compared with A3 Mega, A3 Ultra offers:

2x the GPU-to-GPU networking bandwidth, powered by Google Cloud’s Titanium ML network adapter and backed by our Jupiter data center network
Up to 2x higher LLM inferencing performance with nearly double the memory capacity and 1.4x more memory bandwidth
Ability to scale to tens of thousands of GPUs in a dense, performance-optimized cluster for large AI and HPC workloads

With system blueprints, available through Cluster Toolkit, customers can quickly and easily create turnkey, pre-configured HPC clusters with Slurm support on A3 VMs.

A3 Ultra VMs will also be available through Google Kubernetes Engine (GKE), which provides an open, portable, extensible, and highly-scalable platform for large-scale training and serving of AI workloads.

Trillium: Ushering in a new era of TPU performance for AI

Tensor Processing Units, or TPUs, power our most advanced AI models such as Gemini, popular Google services like Search, Photos, and Maps, as well as scientific breakthroughs like AlphaFold 2 — which led to a Nobel Prize this year!

We recently announced that Trillium, our sixth-generation TPU, is available to Google Cloud customers in preview.

Compared with TPU v5e, Trillium delivers:

Over 4x improvement in training performance
Up to 3x increase in inference throughput
67% increase in energy efficiency
4.7x increase in peak compute performance per chip
Double the high bandwidth memory capacity
Double the interchip interconnect bandwidth

Cluster Toolkit: Streamlining HPC deployments

We continue to improve Cluster Toolkit, providing open-source tools for deploying and managing HPC environments on Google Cloud. Recent updates include:

Slurm-gcp V6 is now generally available, providing faster deployments and robust reconfiguration among other benefits.
Google Cloud Customer Care is now available for Toolkit. You can find more information here on how to get support via the Cloud Customer Care console.
HPC VM Image Rocky Linux 8 is now generally available, making it easy to build an HPC-ready VM instance, incorporating our best practices running HPC on Google Cloud.

GKE: Container orchestration with scale and performance

GKE continues to lead the way for containerized workloads with the support of the largest Kubernetes clusters in the industry. With support for up to 65,000 nodes, we believe GKE offers more than 10X larger scale than the other two largest public cloud providers.

At the same time, we continue to invest in automating and simplifying the building of HPC and AI platforms, with:

Secondary boot disk, which provides faster workload startups through container image caching
Fully-managed DCGM metrics for improved accelerator monitoring
Custom compute classes, offering greater control over compute resource allocation and scaling
Extensive innovations in Kueue.sh, which is becoming the de facto standard for job queueing on Kubernetes with topology-aware scheduling, priority and fairness in queueing, multi-cluster support (see demo by Google and CERN engineers), and more

Customer success stories: Atommap and beyond

Atommap, a company specializing in atomic-scale materials design, is using Google Cloud HPC to accelerate its research and development efforts. With H3 VMs and Parallelstore, Atommap has achieved:

Significant speedup in simulations: Reduced time-to-results by more than half, enabling faster innovation
Improved scalability: Easily scaled resources for 1,000s to 10,000s of molecular simulations, to meet growing computational demands
Better cost-effectiveness: Optimized infrastructure costs, with savings of up to 80%, while achieving high performance

Atommap's success story highlights the transformative potential of Google Cloud HPC for organizations pushing the boundaries of scientific discovery and technological advancement.

Looking ahead

Google Cloud is committed to continuous innovation for HPC. Expect further enhancements to HPC VMs, Parallelstore, Cluster Toolkit, Slurm-gcp, and other HPC products and solutions. With a focus on performance, scalability, compatibility, and ease of use, we’re empowering researchers and engineers to tackle the world's most complex computational challenges.

Google Cloud Advanced Computing Community

We’re excited to announce the launch of the Google Cloud Advanced Computing Community, a new kind of community of practice for sharing and growing HPC, AI, and quantum computing expertise, innovation, and impact.

This community of practice will bring together thought leaders and experts from Google, its partners, and HPC, AI, and quantum computing organizations around the world for engaging presentations and panels on innovative technologies and their applications. The Community will also leverage Google’s powerful, comprehensive, and cloud-native tools to create an interactive, dynamic, and engaging forum for discussion and collaboration.

The Community launches now, with meetings starting in December 2024 and a full rollout of learning and collaboration resources in early 2025. To learn more, register here.

Google Cloud at Supercomputing 2024

The annual Supercomputing Conference series brings together the global HPC community to showcase the latest advancements in HPC, networking, storage and data analysis. Google Cloud is excited to return to Supercomputing 2024 in Atlanta with our largest presence ever.

Visit Google Cloud at booth #1730 to jump in and learn about our HPC, AI infrastructure, and quantum solutions. The booth will feature a Trillium TPU board, NVIDIA H200 GPU and ConnectX-7 NIC, hands-on labs, a full schedule of talks, a comfortable lounge space, and plenty of great swag!

The booth theater will include talks from ARM, Altair, Ansys, Intel, NAG, SchedMD, Siemens, Sycomp, Weka, and more. Booth labs will get you deploying Slurm clusters to fine-tune the Llama2 model or run GROMACS using Cloud Batch to run microbenchmarks or quantum simulations, and more.

We’re also involved in several parts of SC24's technical program, including BoFs, User Groups, and Workshops. Googlers will participate in the following technical sessions:

Converged HPC and Cloud Computing in the Era of Generative AI (Bill Magro speaking)
HPC & Cloud Convergence: drivers, triggers, and constraints (Felix Schürmann speaking)
DAOS User Group (DUG) ‘24 (Dean Hildebrand speaking)
DAOS BoF (Dean Hildebrand speaking)
9th International Parallel Data Systems Workshop (PDSW) (Dean Hildebrand speaking)
IO500: The High-Performance Storage Community BoF (Dean Hildebrand speaking)
High-Performance Object Storage: I/O for the Exascale Era Tutorial (Dean Hildebrand speaking)
Women in HPC Workshop

Google is also hosting or sponsoring the following exciting events during SC24. We’re looking forward to seeing you there!

Finally, we’ll be holding private meetings and roadmap briefings with our HPC leadership throughout the conference. To schedule a meeting, please contact hpc-sales@google.com.

Parallelstore is now GA, fueling the next generation of AI and HPC workloads

Fri, 04 Oct 2024 17:00:00 +0000

Organizations use artificial intelligence (AI) and high-performance computing (HPC) applications to process massive datasets, run complex simulations, and train generative models with billions of parameters for diverse use cases such as LLMs, genomic analysis, quantitative analysis, or real-time sports analytics. These workloads place big performance demands on their storage systems, requiring high throughput and I/O performance that scales and that maintains sub-millisecond latencies, even when thousands of clients are concurrently reading and writing the same shared files.

To power these next-generation AI and HPC workloads, we announced Parallelstore at Google Cloud Next 2024, and today, we are excited to announce that it is now generally available. Built on the Distributed Asynchronous Object Storage (DAOS) architecture, Parallelstore combines a fully distributed metadata and key-value architecture to deliver high-performance throughput and IOPS.

Read on to learn how Parallelstore serves the needs of complex AI and HPC workloads, allowing you to maximize goodput and GPU/TPU utilization, programmatically move data in and out of Parallelstore, and provision Google Kubernetes Engine and Compute Engine resources.

Maximize goodput and GPU/TPU utilization

To overcome the performance limitations of traditional parallel file systems, Parallelstore uses a distributed metadata management system and a key-value store architecture. Parallelstore’s high-throughput parallel data access minimizes latency and I/O bottlenecks, and allows it to saturate the network bandwidth of individual compute clients. This efficient data delivery maximizes goodput to GPUs and TPUs, a critical factor for optimizing AI workload costs. Parallelstore can also provide continuous read/write access to thousands of VMs, GPUs and TPUs, satisfying modest-to-massive AI and HPC workload requirements.

For a 100 TiB deployment, the maximum Parallelstore deployment, throughput scales to ~115 GiB/s, ~3 million read IOPS, ~1 million write IOPS, and a low-latency of ~0.3 ms. This means that Parallelstore is also a good platform for small files and random, distributed access across a large number of clients. For AI use cases, Parallelstore’s performance with small files and metadata operations enables up to 3.9x faster training times and up to 3.7x higher training throughput compared to native ML framework data loaders, as measured by Google Cloud benchmarking.

Programmatically move data in and out of Parallelstore

Many AI and HPC workloads store data in Cloud Storage for data preparation or archiving. You can use Parallelstore’s integrated import/export API to automate movement of the data you’d like to import to Parallelstore for processing. With the API, you can ingest massive datasets from Cloud Storage into Parallelstore at ~20GB/s for files larger than 32MB, and at ~5,000 files per second for files under 32MB.

code_block: <ListValue: [StructValue([('code', 'gcloud alpha parallelstore instances import-data $INSTANCE_ID\r\n--location=$LOCATION --source-gcs-bucket-uri=gs://$BUCKET_NAME\r\n[--destination-parallelstore-path="/"] --project= $PROJECT_ID'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1e3f7070>)])]>

When an AI training job or HPC workload is complete, you can export results programmatically to Cloud Storage for further assessment or longer-term storage. You can also automate data transfers via the API, minimizing manual intervention and streamlining data pipelines.

code_block: <ListValue: [StructValue([('code', 'gcloud alpha parallelstore instances export-data $INSTANCE_ID --location=$LOCATION --destination-gcs-bucket-uri=gs://$BUCKET_NAME\r\n[--source-parallelstore-path="/"]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1e3f7610>)])]>

Programmatically provision GKE resources through the CSI driver

It’s easy to efficiently manage high-performance storage for containerized workloads through Parallelstores’ GKE CSI driver. You can dynamically provision and manage Parallelstore file systems as persistent volumes or access existing Parallelstore instances in Kubernetes workloads, directly within your GKE clusters using familiar Kubernetes APIs. This reduces the need to learn and manage a separate storage system, so you can focus on optimizing resources and lowering TCO.

code_block: <ListValue: [StructValue([('code', 'apiVersion: storage.k8s.io/v1\r\nkind: StorageClass\r\nmetadata:\r\n name: parallelstore-class\r\nprovisioner: parallelstore.csi.storage.gke.io\r\nvolumeBindingMode: Immediate\r\nreclaimPolicy: Delete\r\nallowedTopologies:\r\n- matchLabelExpressions:\r\n - key: topology.gke.io/zone\r\n values:\r\n - us-central1-a'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1e3f7bb0>)])]>

In the coming months, you’ll be able to preload data from Cloud Storage via the fully managed GKE Volume Populator, which automates the preloading of data from Cloud Storage directly into Parallelstore during the PersistentVolumeClaim provisioning process. This helps ensure your training data is readily available, so you can minimize idle compute-resource time and maximize GPU and TPU utilization.

Programmatically provision Compute Engine resources with the Cluster Toolkit

It’s easy to deploy Parallelstore instances for Compute Engine with the support of the Cluster Toolkit. Formerly known as Cloud HPC Toolkit, Cluster Toolkit is open-source software for deploying HPC and AI workloads. Cluster Toolkit provisions compute, network, and storage resources for your cluster/workload following best practices. You can get started with Cluster Toolkit today by incorporating the Parallelstore module into your blueprint with only a four-line change in your blueprint; we also provide starter blueprints for your convenience. In addition to the Cluster Toolkit, there are also Terraform templates for deploying Parallelstore, supporting operations and provisioning processes through code and minimizing manual operational overhead.

code_block: <ListValue: [StructValue([('code', 'resource "google_parallelstore_instance" "instance" { \r\ninstance_id = "instance" \r\nlocation = "us-central1-a" \r\ndescription = "test instance" \r\ncapacity_gib = 12000 \r\nnetwork = google_compute_network.network.name \r\nfile_stripe_level = "FILE_STRIPE_LEVEL_MIN" \r\ndirectory_stripe_level = "DIRECTORY_STRIPE_LEVEL_MIN" \r\nlabels = { \r\ntest = "value" \r\n} \r\nprovider = google-beta \r\ndepends_on = [google_service_networking_connection.default] \r\n} \r\n\r\nresource "google_compute_network" "network" { \r\nname = "network" \r\nauto_create_subnetworks = true \r\nmtu = 8896 \r\nprovider = google-beta \r\n} \r\n\r\n# Create an IP address \r\nresource "google_compute_global_address" "private_ip_alloc" { \r\nname = "address" \r\npurpose = "VPC_PEERING" \r\naddress_type = "INTERNAL" \r\nprefix_length = 24 \r\nnetwork = google_compute_network.network.id \r\nprovider = google-beta \r\n} \r\n\r\n# Create a private connection \r\nresource "google_service_networking_connection" "default" { \r\nnetwork = google_compute_network.network.id \r\nservice = "servicenetworking.googleapis.com"\r\nreserved_peering_ranges = [google_compute_global_address.private_ip_alloc.name] \r\nprovider = google-beta \r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1e3f71f0>)])]>

Real-world impact: Respo.vision sees more with Parallelstore

Respo.Vision, a leader in sports video analytics, is leveraging Parallelstore to accelerate an upgrade from 4K to 8K videos for their real-time system. By using Parallelstore as the transport layer, Respo.vision helps capture and label granular data markers, delivering actionable insights to coaches, scouts, and fans. With Parallelstore, Respo.vision avoided pricey infrastructure investments to manage surges of high-performance video processing, all while maintaining low compute latency.

“Our goal was to process 8K video streams at 25 frames per second to deliver richer quality sports analytical data to our customers, and Parallelstore exceeded expectations by effortlessly handling the required volume and delivering an impressive read latency of 0.3 ms. The integration into our system was remarkably smooth and thanks to its distributed nature, Parallelstore has significantly enhanced our system's scalability and resilience.” - Wojtek Rosinski, CTO, Respo.vision

HPC and AI usage is growing rapidly. With its combination of innovative architecture, performance, and integration with Cloud Storage, GKE, and Compute Engine, Parallelstore is the storage solution you need to keep the demanding GPU/TPUs and workloads satisfied. To learn more about Parallelstore, check out the documentation, and reach out to your sales team for more information.

Boosting Google Cloud HPC performance with optimized Intel MPI

Tue, 13 Aug 2024 16:00:00 +0000

High performance computing (HPC) is central to fueling innovation across industries. Through simulation, HPC accelerates product design cycles, increases product safety, delivers timely weather predictions, enables training of AI foundation models, and unlocks scientific discoveries across disciplines to name but a few examples. HPC tackles these computationally demanding problems by employing large numbers of computing elements, servers, or virtual machines, in tight orchestration with one another and communicating via the Message Passing Interface (MPI). In this blog, we show how we boosted HPC performance on Google Cloud using Intel® MPI Library.

Google Cloud offers a wide range of VM families that cater to demanding workloads, including H3 compute optimized VMs, which are ideal for HPC workloads. These VMs feature Google’s Titanium technology, for advanced network offloads and other functions, and are optimized by Intel software tools to bring together the latest innovations in computing, networking, and storage into one platform. In third-generation VMs such as H3, C3, C3D or A3, the Intel Infrastructure Processing Unit (IPU) E2000 offloads the networking from the CPU onto a dedicated device, securely enabling low latency 200G Ethernet. Further, integrated support for Titanium in the Intel MPI library, brings the benefits of network offload to HPC workloads such as molecular dynamics, computational geoscience, weather forecasting, front-end and back-end Electronic Design Automation (EDA), Computer Aided Engineering (CAE), and Computational Fluid Dynamics (CFD). The latest version of the Intel MPI Library is included in the Google Cloud HPC VM Image.

MPI Library optimized for 3rd gen VMs and Titanium

Intel MPI Library is a multi-fabric message-passing library that implements the MPI API standard. It’s a commercial-grade MPI implementation based on the open-source MPICH project, and it uses the OpenFabrics Interface (OFI, aka libfabric) to handle fabric-specific communication details. Various libfabric providers are available, each optimized for a different set of fabrics and protocols.

Version 2021.11 of the Intel MPI Library specifically improves the PSM3 provider and provides tunings for the PSM3 and OFI/TCP providers for the Google Cloud environment, including the Intel IPU E2000. The Intel MPI Library 2021.11 also takes advantage of the high core counts and advanced features available on 4th Generation Intel Xeon Scalable Processors and supports newer Linux OS distributions and newer versions of applications and libraries. Taken together, these improvements unlock additional performance and application features on 3rd generation VMs with Titanium.

Boosting HPC application performance

Applications like Siemens SimcenterTM STAR-CCM+TM software shorten the time-to-solution through parallel computing. For example, if doubling the computational resources solves the same problem in half the time, the parallel scaling is 100% efficient, and the speedup is 2x compared to the run with half the resources. In practice, a speedup of 2x per doubling may not be achieved for a variety of reasons, such as not exposing enough parallelism, or overhead from inter-node communication. An improved communication library directly improves the latter problem.

To demonstrate the performance improvements of the new Intel MPI Library, Google and Intel tested Simcenter STAR-CCM+ with several standard benchmarks on H3 instances. The figure shows five standard benchmarks up to 32 VMs (2,816 cores). As you can see, good speedups are achieved throughout the tested scenarios; only the smallest benchmark (LeMans_Poly_17M) stops scaling beyond 16 nodes due to its small problem size (which is not addressed by communication library performance). In some benchmarks (LeMans_100M_Coupled and AeroSuvSteadyCoupled106M), superlinear scaling can even be observed for some VM counts, likely due to the increased available cache.

To show the improvements of Intel MPI 2021.11 over Intel MPI 2021.7, we used the ratio of runtimes between the two for each run. This speedup ratio is computed by dividing the parallel runtime of the older version by the parallel runtime of the newer version; we show those speedup ratios in the table below.

The table shows that for nearly all benchmarks and node counts, the optimized Intel MPI 2021.11 version delivers higher parallel scalability and absolute performance. This gain in efficiency — and thus shorter time-to-solution and lower cost — is already present at just two VMs (up to 1.06x improvement) and grows dramatically at larger VM counts (between 2.42x and 5.27x at 32 VMs). For the smallest benchmark (LeMans_Poly_17M) at 16 VMs, there’s an impressive improvement of 11.53x, which indicates that, unlike the older version, the newer MPI version allows good scaling up to 16 VMs.

These results demonstrate that the optimized Intel MPI Library increases the scalability of Simcenter STAR-CCM+ on Google Cloud, allowing for faster time-to-solution for end users and more efficient use of their cloud resources.

Benchmarks were run using Intel MPI 2021.7 and its TCP provider and Intel MPI 2021.11 and the PSM3 libfabric provider. Simcenter STAR-CCM+ version 2306 (18.06.006) was tested on Google Cloud’s H3 instances, with 88 MPI processes per node and 200 Gbps networking, running CentOS Linux release 7.9.2009.

What customers and partners are saying

“Intel is proud to collaborate with Google to deliver leadership software and hardware for the Google Cloud Platform and H3 VMs. Together, our work gives customers new levels of performance and efficiency for computational fluid dynamics and HPC workloads.” -Sanjiv Shah, Vice President, Intel, Software and Advanced Technology Group, General Manager, Developer Software Engineering

Trademarks

A list of relevant Siemens trademarks can be found here. Other trademarks belong to their respective owners.

Build large-scale AI/ML and HPC clusters with Cluster Toolkit (formerly HPC Toolkit)

Fri, 02 Aug 2024 16:00:00 +0000

Update: Starting the week of September 16, 2024, Google Cloud customers with eligible support plans can access assistance for the Cluster Toolkit through the Google Cloud console. Cluster Toolkit, formerly known as Cloud HPC Toolkit, is open-source software offered by Google Cloud that simplifies the process for you to deploy HPC, AI and ML workloads on Google Cloud. The Cloud Support team will handle filed cases, ensuring that you receive timely and effective support for your Cluster Toolkit implementations. Select 'Cluster Toolkit' as the sub-category under 'Compute Engine' when creating a support ticket in the console to get in touch about any Cluster Toolkit issues.

The Cloud HPC Toolkit, now rebranded as Cluster Toolkit, simplifies the creation and management of high performance computing environments on Google Cloud. Initially focused on scientific and technical computing workloads, it has expanded to encompass AI/ML applications, reflecting its widespread adoption across various domains.

The Cluster Toolkit empowers users to focus on their workloads by streamlining cluster setup and deployment, leveraging Google Cloud's best practices, and offering flexibility for diverse computing tasks. Key benefits include:

Easy deployment and management of clusters: The Toolkit simplifies the process of setting up and maintaining clusters, allowing users to focus on their workloads rather than infrastructure management. The Toolkit supports multiple schedulers including Slurm, GKE, and Batch.
Quickstart options for HPC and AI/ML workloads: The Toolkit has a library of pre-built blueprints and modules that let users begin running their workloads quickly, accelerating time-to-value.
Integration of Google Cloud best practices: The aforementioned blueprints and modules incorporate Google Cloud's recommended configurations, ensuring that clusters are set up for optimal performance and efficiency.
Regular updates and new features: The Toolkit is actively maintained and updated with new features and improvements, providing users with ongoing support and enhancements.
Open-source accessibility: The Toolkit is open-source, allowing users to customize and extend its capabilities to meet their specific needs.

What's new in Cluster Toolkit

In addition to a new name, Cluster Toolkit has several new features for HPC and AI/ML workloads:

A3 Mega Blueprint: This blueprint makes it easy to deploy a cluster of A3 Mega VMs ready for training large language models (LLMs) and other AI/ML workloads. Earlier in the year, we also launched the A3 Blueprint.
HPC VM Image: This VM Image is pre-installed with popular HPC tools and libraries, ensuring you can begin running your HPC workloads quickly with assured performance.

The Rocky 8 version of the HPC VM Image is now GA.
Note that we have released the final CentOS 7 version of the HPC VM Image. CentOS reached end-of-life on June 30, 2024, meaning that it will no longer receive security updates. Going forward, we strongly recommend moving to Rocky 8 and will be releasing regular Rocky 8 versions of the HPC VM Image.
We are releasing the ability to disable automatic updates in the HPC VM Image. Automatic updates can disrupt the performance of HPC applications, so we’re giving you the option to turn them off via metadata.

Slurm-gcp v6: The latest version of the Slurm-gcp solution, which provides a seamless experience for running Slurm workloads on Google Cloud, is now GA.

Guidelines for existing Toolkit customers

We've renamed our GitHub repo to “Cluster Toolkit” and some commands (e.g., ghpc is now gcluster). Existing Git operations and commands will still work, but we strongly recommend updating local clones and command names to avoid confusion.

How to get started

To get started with the Cluster Toolkit, select one of our easy-to-use HPC and AI/ML blueprints, available through our GitHub repo, and use it to set up a cluster. We also offer a variety of resources to help you get started, including documentation, quickstarts, and videos.

Enhancing the HPC experience with Slurm-GCP v6 and TPU support

Mon, 10 Jun 2024 17:00:00 +0000

On Google Cloud, our HPC-optimized infrastructure, including the AI Hypercomputer, can be deployed in multiple ways according to user preferences. For customers that want a Slurm-based environment, we recommend using the Cloud HPC Toolkit, a Google product that helps simplify the creation and management of HPC systems for AI/ML and traditional HPC workloads. The Toolkit features our Slurm-GCP offering, a set of Slurm scripts that helps automate the installation, deployment, and certain operational aspects of Slurm on Google Cloud.

Today we’re excited to announce the general availability of Slurm-GCP v6, the latest and recommended version, which will run on Slurm 23.11. This release is the result of our ongoing multi-year collaboration with the engineering experts at SchedMD.

Slurm-GCP v6 provides the following benefits, compared with v5:

Faster deployments

A simple cluster - consisting of Slurm infrastructure with a pre-existing VPC and without deploying any file systems in parallel or using autoscaling clusters - now deploys 3x faster than the previous version.

Robust reconfiguration

Reconfiguration is a Slurm-GCP mechanism allowing making changes to a running cluster and this process is now managed by a service that runs on each instance, providing a more consistent experience. Reconfiguration has also been enabled by default, enabling easier reconfiguration of a running cluster.

More deployments in a single project

We have lifted the restriction on the number of clusters that can be deployed in a single project.

Fewer dependencies in the deployment environment

Reconfiguration and compute node cleanup features are now enabled by default and no longer require users to set them up, making it easier to manage Slurm clusters.

Full support for TPU v3 and v4

TPU v3 and v4 are now fully supported, allowing TPU and GPU partitions to be configured alongside each other for maximum flexibility in choosing your preferred accelerators.

Start using v6 today by navigating to the Toolkit blueprint library. These include blueprints like Running the MaxText ML Benchmark on TPUs with Slurm, and Running Apptainer Containers with Slurm. Blueprints using a prior version of Slurm-gcp will contain “v5” in the name and be supported through November 2024.

Performing large-scale computation-driven drug discovery on Google Cloud

Mon, 13 May 2024 16:00:00 +0000

Editor’s note: Today we hear from Atommap, a computational drug discovery company that has built an elastic supercomputing cluster on the Google Cloud to empower large-scale, computation-driven drug discovery. Read on to learn more.

Bringing a new medicine to patients typically happens in four stages: (1) target identification that selects the protein target associated with the disease, (2) molecular discovery that finds the new molecule modulating the function of the target, (3) clinical trial that tests the candidate drug molecule’s safety and efficacy in patients, and (4) commercialization that distributes the drug to patients in needs. The molecular discovery stage, in which novel drug molecules are invented, involves solving two problems: first, we need to establish an effective mechanism to modulate the target function that maximizes the therapeutic efficacy and minimizes the adverse effect; second, we need to design, select, and make the right drug molecule that faithfully implements the mechanism, is bioavailable, and has acceptable toxicity.

What makes molecular discovery hard?

A protein is in constant thermal motion, which changes its shape (conformation) and binding partners (other biomolecules), thus affecting its functions. Structural detail of a protein’s conformational dynamics time and again suggests novel mechanisms of functional modulation. But such information often eludes experimental determination, despite tremendous progress in experimental techniques in recent years.

The chemical “universe” of all possible distinct small molecules — estimated to number 10⁶⁰ (Reymond et al. 2010) — is vast. Chemists have made probably ten billion so far, so we still have about 10⁶⁰ to go.

There lie two major challenges of molecular discovery and its endless opportunity: chances are that we have not considered all the mechanisms of action or found the best molecules, thus we can always invent a better drug.

Atommap’s computation-driven approach to molecular discovery

Harnessing the power of high-performance computing, Atommap’s molecular engineering platform enables the discovery of novel drug molecules against previously intractable targets through new mechanisms, making the process faster, cheaper, and more likely to succeed. In past projects, Atommap’s platform has dramatically reduced both the time (by more than half) and cost (by 80%) of molecular discovery. For example, it played a pivotal role in advancing a molecule against a challenging therapeutic target to the clinical trial in 17 months (NCT04609579) and it substantially accelerated the discovery of novel molecules that degrade high-valued oncological targets (Mostofian et al. 2023).

Atommap achieves this by:

Advanced molecular dynamics (MD) simulations that unveil complex conformational dynamics of the protein target and its interactions with the drug molecules and other biomolecules. They establish the dynamics-function relationship for the target protein, which is instrumental to choosing the best mechanism of action for the drug molecules.
Generative models that enumerate novel molecules. Beginning with a three-dimensional blueprint of a drug molecule's interaction with its target, our models computationally generate thousands to hundreds of thousands of new virtual molecules, which are designed to form the desired interactions and to satisfy both synthetic feasibility and favorable drug-like properties.
Physics-based, ML-enhanced predictive models that accurately predict molecular potencies and other properties. Every molecular design is evaluated computationally for its target-binding affinity, its effects on the target, and its drug-likeness. This allows us to explore many times more molecules than can be synthesized and tested in the wet lab, and to perform multiple rounds of designs while waiting for often-lengthy experimental evaluation, leading to compressed timelines and increased probability of success.

Computation as a Service and Molecular Discovery as a Service

To truly, broadly impact drug discovery, Atommap needs to augment its deep expertise in molecular discovery by partnering with external expertise in the other stages — target identification, clinical trials, and commercialization. We form partnerships in two ways: Computation as a Service (CaaS) and Molecular Discovery as a Service (MDaaS, pronounced Midas), which make it easy and economically attractive for every drug discovery organization to access our computation-driven molecular engineering platform.

Instead of selling software subscriptions, Atommap’s pay-as-you-go CaaS model lets any discovery project first try our computational tools at a small and affordable scale, without committing too much budget. Not every project is amenable to computational solutions, but most are. This approach allows every drug discovery project to introduce the appropriate computations cheaply and quickly, with demonstrable impact, and then deploy them at scale to amplify their benefits.

For drug hunters who would like to convert their biological and clinical hypotheses into drug candidates, our MDaaS partnership allows them to quickly identify potent molecules with novel intellectual property for clinical trials. Atommap executes the molecular discovery project from the first molecule (initial hits) to the last molecule (development candidates), freeing our partners to focus on biological and clinical validation.

The need for elastic computing

Figure 1. Diverse computational tasks in Atommap’s molecular engineering platform require elastic computing resources.

For Atommap, the number of partnership projects and the scale of computation in each project fluctuate over time. In building structural models to enable structure-based drug design, we run hundreds of long-timescale MD simulations on high-performance GPUs to explore the conformational ensembles of proteins and complexes between proteins and small molecules, each of which can last hours to days. Our NetBFE platform for predicting the binding affinities invokes thousands, sometimes tens of thousands, of MD simulations, although each one is relatively short and completes in a few hours. Atommap’s machine learning (ML) models take days to weeks to train on high-memory GPUs, but once trained and deployed in a project, run in seconds to minutes. Balancing the different computational loads associated with different applications poses a challenge to the computing infrastructure.

To meet this elastic demand, we chose to supplement our internal computer clusters with Google Cloud.

How to build an elastic supercomputer on Google Cloud

It took us several steps to move our computing platform from our internal cluster to a hybrid environment that includes Google Cloud.

Slurm

Many workflows in our platform depended on Slurm for managing the computing jobs. To migrate to Google Cloud, we built a cloud-based Slurm cluster using Cloud HPC Toolkit, an open-source utility developed by Google. Cloud HPC Toolkit is a command line tool that makes it easy to stand up connected and secure cloud HPC systems. With this Slurm cluster up and running in minutes, we quickly put it to use with our Slurm-native tooling to set up computing jobs for our discovery projects.

Cloud HPC Toolkit naturally fits our DevOps function into best practices. We defined our compute clusters as “blueprints” within YAML files that allow us to simply and transparently configure specific details of individual Google Cloud products. The Toolkit transpiles blueprints into input scripts that are executed with Hashicorp’s Terraform, an industry standard tool for defining “infrastructure-as-code” such that it can be committed, reviewed, and version-controlled. Within the blueprint we also defined our compute machine image through a startup script that’s compatible with Hashicorp’s Packer. This allowed us to easily “bake in” the software our jobs typically need, such as conda, Docker, and Docker container images that provide dependencies such as AMBER, OpenMM, and PyTorch.

The deployed Slurm cloud system is as accessible and user-friendly as any Slurm system we have used before. The compute nodes are not deployed until requested and are spun down when finished, thus we only pay for what we use; the only persistent nodes are the head and controller nodes that we log into and deploy from.

Batch

Compared to Slurm, the cloud-native Google Batch gives us even greater flexibility in accessing the computing resources. Batch is a managed cloud job-scheduling service, meaning it can be used to schedule cloud resources for long-running scientific computing jobs. Virtual machines that Batch spins up can easily mount either NFS stores or Google Cloud Storage buckets, the latter of which are particularly suitable for holding our multi-gigabyte MD trajectories and thus useful as output directories for our long-running simulations.

Running our workflows on Google Cloud through the Batch involves two steps: 1) copying the input files to Google Cloud storage, 2) submitting the batch job.

code_block: <ListValue: [StructValue([('code', '$ gcloud storage cp -R ./local_input_dir gs://my-gcs-bucket/work_dir\r\n$ gcloud batch jobs submit example-job --config=./job_cfg.json --location=us-central1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1e3fc6d0>)])]>

SURF

A common pattern has emerged in most of our computational workflows. First, each job has a set of complex input files, including the sequence and structures of the target protein, a list of small molecules and their valence and three-dimensional structures, and the simulation and model parameters. Second, most computing jobs take hours to days to finish even on the highest-performance machines. Third, the computing jobs produce output datasets of substantial volume and subject to a variety of analyses.

Accordingly, we have recently developed a new computing infrastructure, SURF (submit, upload, run, fetch), which seamlessly integrates our internal cluster and Google Cloud through one simple interface and automatically brings the data to where it is needed by computation or the computation to where the data resides.

code_block: <ListValue: [StructValue([('code', '$ am-surf submit my_job/ --num-cpus 16 --num-gpus 8 --where [gcp|internal]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce1e3fc7f0>)])]>

SURF submits jobs to Google Cloud Batch using Google’s Python API.

We now have an elastic supercomputer on the cloud that gives us massive computing power when we need it. It empowers us to explore the vast chemical space at an unprecedented scale and to invent molecules that better human health and life.

^{Andrew Sabol supported us from the very beginning of Atommap, even before we knew whether we could afford the computing bills. Without the guidance and technical support of Vincent Beltrani, Mike Sabol, and other Google colleagues, we could not have rebuilt our computing platform on Google Cloud in such a short time. Our discovery partners put their trust in our young company and our burgeoning platform; their collaborations helped us validate our platform in real discovery projects and substantially improve its throughput, robustness, and predictive accuracy.}

High Performance Computing

H4D VMs, now GA, deliver exceptional performance and scaling for HPC workloads

Faster time to solution across domains and scales

A new standard for HPC price/performance

Comprehensive HPC management

What customers and partners are saying

Experience H4D today

Accelerating discovery at the speed of cloud: What’s New for HPC at Google Cloud for SC25

Redefining supercomputing with cloud-native HPC

AI-powered scientific discovery

How scientists can leverage AI agents using Gemini Enterprise, Gemini Code Assist, and Gemini CLI

AI-enhanced scientific inquiry

Evolving Ray and Kubernetes together for the future of distributed AI and ML

Ray and Kubernetes label-based scheduling

Advancing accelerator support in Ray and Kubernetes

Ray-native resource isolation with Kubernetes writable cgroups

Ray vertical autoscaling with in-place pod resizing

Ray + Kubernetes = The distributed OS for AI

A more native experience for Cloud TPUs with Ray on GKE

Google Cloud and AMD at STAC Summit NYC: H4D VMs for Finance

H4D VMs for financial services

H4D: Superior performance for financial workloads

A full spectrum of financial services solutions

Come talk to experts!

G4 VMs under the hood: A custom, high-performance P2P fabric for multi-GPU workloads

Collective communications performance matters

How G4 accelerates multi-GPU performance

Inference performance boost by P2P on G4

Throughput or speed: G4 with P2P lets you choose

Scale further with G4 and GKE Inference Gateway

G4 P2P supported VM Shapes

Get started with P2P on G4

Open-source and enterprise-ready: IBM Spectrum Symphony connectors for Google Cloud

Partner-built and tested for enterprise scale

Powerful features to run your way

Getting started

5 best practices for Managed Lustre on Google Kubernetes Engine

1. Design for data locality

2. Right-size your performance with tiers

3. Master your networking foundation

4. Use dynamic provisioning for simplicity, static for long-lived shared data

5. Architecting for parallelism with Kubernetes Jobs

Get started today

Accelerate your AI workloads with the Google Cloud Managed Lustre

Performance tiers and pricing

Driving innovation together: partnering with DDN

Get started today!

Watch the Fireside Chat

SandboxAQ: Accelerating drug discovery through cloud integration

Cloud-native development for scientific insight

SandboxAQ solution in the real world

H4D VMs: Next-generation HPC-optimized VMs

Improved HPC application performance

Faster HPC with Cloud RDMA on Titanium

Cluster management and scheduling capabilities

VM sizes and regional availability

What our customers and partners are saying

Want to try it out?

Colossus: the secret ingredient in Rapid Storage’s high performance

Enabling global scientific discovery and innovation on Google Cloud

Supercomputing-class infrastructure and tools for science

Advanced scientific applications powered by AI models

Using AI as the ultimate research partner

Enabling transformational science today and tomorrow

Driving enterprise transformation with new compute innovations and offerings

New and enhanced compute for every workload

Storage and platform capabilities for greater scale and efficiency

Match resources to your usage patterns

Invest for innovation with optimized compute

What’s new with HPC and AI infrastructure at Google Cloud

Next-generation HPC VMs

Parallelstore: World’s first fully-managed DAOS offering

A3 Ultra VMs with NVIDIA H200 Tensor Core GPUs

Trillium: Ushering in a new era of TPU performance for AI

Cluster Toolkit: Streamlining HPC deployments

GKE: Container orchestration with scale and performance

Customer success stories: Atommap and beyond

Looking ahead

Google Cloud Advanced Computing Community

Google Cloud at Supercomputing 2024