Networking

C4N, now GA: Delivering cloud’s highest per vCPU network and block storage I/O for x86 workloads

Wed, 08 Jul 2026 20:00:00 +0000

As organizations scale modern workloads — from high-throughput databases and network/security appliances to real-time analytics and AI/ML inference — network and block storage performance can quickly become a bottleneck. Standard virtual machines often struggle to balance compute efficiency with the high-volume data-transfer demands of these applications.

At Google Cloud Next ‘26, we announced C4N in preview, our first network- and block-storage-optimized Google Compute Engine instance that’s purpose-built to eliminate I/O bottlenecks for demanding enterprise applications, and today, it is generally available. Built on Google's custom-designed Titanium offload architecture, C4N instances offload network and storage tasks to dedicated hardware to unlock incredible performance and compute efficiency. C4N offers up to 400 Gbps of network bandwidth and a market-leading 95 million packets per second (MPPS) — nearly 33% higher network bandwidth per vCPU and 224% faster packet processing performance than comparable Intel-based offerings at other hyperscalers. This performance makes C4N a great fit for network-intensive applications such as virtual appliances (e.g., next-gen firewalls, virtual routers, load balancers, DDoS mitigation), large-scale data analytics, telco applications (5G UPF), distributed compute and CPU-based AI/ML workloads.

Paired with Hyperdisk Extreme, our high-performance block storage, C4N also delivers Compute Engine’s highest block storage performance, scaling up to 25 GiB/s of storage bandwidth and 1M IOPS — nearly 33% higher storage bandwidth and 39% more IOPS per vCPU versus comparable Intel-based offerings, making them a strong choice for large-scale databases, high-performance file systems, in-memory databases, and other workloads that benefit from high block storage performance.

Engineered specifically to deliver predictable, high-throughput I/O performance for networking, packets-per-second-bound and storage-optimized applications, C4N allows customers to scale network, storage, and compute resources more precisely to meet specific workload requirements, delivering significant TCO benefits by eliminating the need to over-provision resources just to meet I/O demands.

C4N is powered by 5th Gen Intel® Xeon® Scalable processors (code-named Emerald Rapids).

“Google Cloud’s introduction of C4N highlights how infrastructure innovation and a strong silicon foundation can help customers address increasingly data-intensive workloads. With Intel Xeon and Custom Infrastructure Processing Unit (IPU), C4N delivers the performance and efficiency needed for demanding network optimized environments.” – Srini Krishna, Intel Fellow, Data Center products, Intel

What’s new: Scaling massive data layers with C4N

Our network-optimized C4N instances are designed to deliver predictable, high-performance I/O at scale. By providing consistent bandwidth, packet-processing performance (PPS), and IOPS scaling across all VM shapes and sizes, C4N helps ensure your most demanding data workloads run reliably. To achieve this, we have built deep resiliency into every layer of our infrastructure — from the host and fabric layers to redundant top-of-rack (ToR) switches — delivering continuous performance for your applications.

Compared to general-purpose C4 VMs, the network-optimized C4N delivers significant performance gains across both network and block storage vectors.

Next-generation network performance

Superior VM-to-VM network bandwidth: Achieves up to 400 Gbps of VM-to-VM network bandwidth (an almost 4x increase in BW-per-vCPU over standard C4) and supports up to 50 Gbps single-flow bandwidth between C4N instances routed within the same VPC network. This provides non-blocking data delivery for high-throughput single-stream and multi-stream applications.
Enhanced VM-to-internet performance: Benefits from an 8x increase in internet egress network bandwidth, reaching up to 200 Gbps. It also features a nearly 32x increase in internet egress packet processing performance, scaling up to 48 MPPS.
Optimized I/O for smaller shapes: Keeps your cloud bill lean by delivering up to 25–50 Gbps of network bandwidth specifically for 2–16 vCPU shapes, great for accelerating I/O-bound tasks without needing to over-provision compute. Furthermore, these smaller shapes introduce predictable, steady-state baseline bandwidth limits to provide consistent performance at a lower cost.
Enhanced out-of-the-box networking: gVNIC interfaces on C4N now start with more Tx/Rx queues by default, scaling with vCPUs up to a maximum of 64 (in comparison to 16 queues on C4/C4D).
Shorter Google Cloud Storage transfer times: C4N VMs now offer up to a 2x increase in bandwidth to retrieve and store large volumes of data from Cloud Storage, boosting performance for analytics, AI/ML, and backup workloads.

Better yet, this performance is available out of the box, with no add-ons. Designed for high performance from the get-go, C4N offers maximum performance without needing to purchase or configure premium add-ons like Tier_1 networking.

Dynamic storage performance with Hyperdisk

The C4N instance family, when combined with Hyperdisk, can help dynamically tune storage performance, latency, and throughput independently of your compute instance sizing to deliver high block storage performance for your applications. C4N supports the complete Hyperdisk portfolio, including Hyperdisk Balanced, Balanced High Availability, Extreme, Throughput, and ML block storage options.

Hyperdisk Extreme: C4N with Hyperdisk Extreme provides low-latency, high-speed data access for modern databases and enterprise AI applications, with up to 25 GiB/s of block storage throughput and nearly 1M IOPS, a 2x increase in storage performance over C4. Also, exclusive to network optimized machine series such as C4N, we now offer Hyperdisk Extreme across all machine sizes — even down to the smallest 2 vCPU sizes.
Hyperdisk Balanced: Delivering the highest throughput and IOPS for general-purpose block storage in the Compute Engine portfolio, Hyperdisk Balanced on C4N scales up to 20 GiB/s of block storage throughput and nearly 640K IOPS. This makes it a highly cost-effective option for running storage-intensive applications at scale.

Together, C4N’s network and storage optimizations combine for tremendous impact in real-world applications:

Web serving: Up to 1.5x additional Nginx requests per second compared to C4 for typical web request sizes (100–300Kb), significantly boosting capacity for network-bound web applications
Databases: Up to 45% better queries per second (QPS) for MySQL when data resides primarily on disk than equivalent C4 VMs

What customers are saying

Industry leaders are already proving that workload-optimized infrastructure is the engine for transformation. Here is how our customers are leveraging the network-optimized power of C4N:

“5G Core workloads are inherently network-heavy, demanding high-throughput packet processing and deterministic latency that standard public cloud instances often struggle to maintain at scale. By leveraging the Google Cloud C4N compute family, we’ve found the ideal engine for Ericsson On-Demand. The C4N’s architectural focus on network-optimized compute allows our 5G Core-as-a-Service to reach unprecedented throughput levels — like our recent 1 Tbps milestone — while maintaining the carrier-grade reliability our customers expect. It’s no longer just about cloud-native; with C4N, we are delivering network-native performance in a public cloud environment.” - Eric Parsons, VP, Head of Ericsson On-Demand, Ericsson

“Teradata's Autonomous Knowledge Platform unifies production-grade AI, analytics, and data into a single integrated system — providing the context, governance, and performance backbone autonomous AI demands at scale. Customers rely on Teradata to run mission-critical, highly I/O-intensive workloads where performance and cost control directly determine value.

Google Cloud C4N instances are well suited for these demanding workloads, delivering strong price-performance and supporting more efficient, optimized deployments. By leveraging C4N on Google Cloud, Teradata Cloud can help customers accelerate from insight to action — scaling enterprise intelligence with confidence and driving greater impact from their data and AI investments” - Kevin Dougherty, Senior Director of Product Management, Core Platform, Teradata

“With the next-generation network and storage bandwidth of C4N VMs, Google Cloud NetApp Volumes will unlock new levels of performance to support our customers’ most demanding AI workloads. By collaborating to extend Google Cloud NetApp Volumes support for the C4N VM family, Google and NetApp are deepening our partnership to address real customer challenges. Together, we’re delivering data-in-place AI and analytics solutions that simplify architectures, maximize performance, and turn data into impact.” - Pravjit Tiwana, Senior Vice President and General Manager of Cloud Storage and Services, NetApp

"Most Compute Engine instances ship with a single high-speed network interface. The new C4N doubles the bandwidth potential with two 200 GbE interfaces. That architectural shift is significant. It means we can dedicate both networks entirely to storage traffic, doubling the available bandwidth for data-intensive workloads, and achieving 2x storage performance over the previous generation. The C4N was announced just weeks ago and is already active in Sycomp's test environment, ensuring our customers can evaluate the latest GCP capabilities without delay. Google Cloud’s published maximum hyperdisk balanced performance for the C4N is 20 GiB/s. In our tests, with three storage servers Sycomp achieved 58.5 GiB/s on read and 58.6 GiB/s on write, with ten C4N storage servers we achieved 195 GiB/s read and write — 97% of the theoretical ceiling with zero platform-specific tuning. That's a strong starting point, and there's measurable room to close the remaining gap through configuration work we can finetune. The C4N isn't just faster — it changes the price-performance equation for storage workloads on Google Cloud." - Scott Fadden, Senior HPC Solutions Architect, Sycomp

“At ClipperDB Technologies, our mission is to drive down the cost and drive up the performance of large-scale Spark analytics. Google Cloud’s C4N instances are the perfect compute engine for our fully native architecture. C4N’s substantial increase in network bandwidth per vCPU combined with large memory configurations and 5th Generation Intel Xeon processors align with ClipperDB’s precise parallel cloud-store prefetching and caching, concurrent dataflow native batch pipelines, streaming no-copy exchange, and cloud store checkpoint fault tolerance to radically accelerate and cost reduce Spark workloads with disaggregated Cloud Storage datalakes.

The results speak for themselves: across industry-standard TPC-DS benchmarks, ClipperDB+C4N delivered over 3x lower cost per query and up to 11x faster analytics, all while maintaining 100% Spark compatibility. We can’t wait to see customers dramatically improve their Spark workload price-performance with C4N coupled with Clipper DB Accelerator." - John Busch, CEO, ClipperDB Technologies

A deeper look at C4N shapes and specs

C4N instances are available in nine different sizes ranging from 2-192 vCPUs and up to 1.5 TB of DDR5 memory, offering predefined shapes in high-cpu, standard, and high-mem configurations.

For applications that benefit from caching and high-speed, low-latency local storage, C4N VM instances are equipped with up to 12 TiB of latest Titanium SSDs (coming soon, Sign-up here to request C4N Local-SSD preview access). For workloads that require direct access to the machine's resources (e.g., hypervisors, container platforms), where nested virtualization does not meet the workload’s performance requirements, or have special performance monitoring or licensing needs, we are introducing C4N bare metal shapes. Coming soon, these native bare metal shapes will offer the same network and storage I/O performance as their virtual machine counterparts. Google Cloud customers can use C4N instances with Compute Engine and Google Kubernetes Engine (GKE), with support for other services coming soon.

Name	vCPUs	Memory (GB)	Local Storage (GiB)	Network Bandwidth		Hyperdisk Extreme Bandwidth (MiB/s)	Hyperdisk Extreme IOPS
Name	vCPUs	Memory (GB)	Local Storage (GiB)	VM-VM (Gbps)	VM-Internet (Gbps)	Hyperdisk Extreme Bandwidth (MiB/s)	Hyperdisk Extreme IOPS
C4n-highcpu	2 - 192	4 - 384	N/A	25 - 400	7 - 200	1.000 - 25,000	80,000 - 1M
C4n-standard	2 - 192	7 - 720	N/A	25 - 400	7 - 200	1.000 - 25,000	80,000 - 1M
C4n-standard-lssd	4 - 192	15 - 720	375 - 12,000	30 - 400	7 - 200	1.000 - 25,000	100,000 - 1M
C4n-highmem	2 - 192	15 - 1,488	N/A	25 - 400	7 - 200	1.000 - 25,000	80,000 - 1M
C4n-highmem-lssd	4 - 192	31 - 1,488	375 - 12,000	30 - 400	7 - 200	1.000 - 25,000	100,000 - 1M

C4N machine series performance and specifications

How to get started

Whether you’re hosting heavy-duty distributed databases, running network virtualization appliances, or orchestrating large-scale data pipelines for AI, C4N is engineered to provide the throughput, scale, and efficiency your business demands. C4N instances are now generally available via on-demand, as Spot VMs, and via reservations. You can also take advantage of further cost savings by purchasing Committed Use Discounts (CUDs) or FlexCUDs in one- and three-year terms in the us-central1 (Iowa), us-east1 (South Carolina), us-east5 (Ohio), us-west1 (Oregon) and europe-west2 (London). For more information visit Network Optimized Machine Type.

Ready to establish a high-performance launchpad for innovation? Head straight to the Google Cloud console to spin up a C4N VM under the “Network Optimized” machine family. Stay up-to-date on regional availability by visiting our regions and zones page or contact your Google Cloud sales representative for more information.

BGP route policies: Top 3 use cases by customer demand

Tue, 07 Jul 2026 16:00:00 +0000

When we first made BGP route policies for Cloud Router generally available over a year ago, our goal was to give network administrators deep, programmable control over how network paths are evaluated and propagated. Since then, we’ve been watching closely how our customers have adopted this feature. We've seen network engineering teams build incredibly sophisticated, resilient routing architectures that were previously difficult to achieve without third-party virtual appliances.

This year, we launched policy named sets for Cloud Router. As routing environments grow more complex, managing individual prefixes or communities within these policies can become cumbersome.

Policy named sets solve this by allowing you to group lists of IPv4/IPv6 prefixes or BGP communities into a single, reusable entity. This significantly simplifies your configurations, making it easier to scale, manage, and update your routing rules across multiple Cloud Routers.

Powered by the Common Expression Language (CEL), BGP route policies allow you to define fine-grained, ordered rules to filter BGP routes and modify route attributes directly within Cloud Router.

To celebrate the launch of policy named sets, we want to highlight three of the most impactful ways we've seen customers use BGP route policies over the past year, along with resources on how you can build them yourself.

1. The foundation: Route filtering and network protection

Before manipulating traffic paths, network stability requires strict control over which routes are allowed into and out of your network. We've seen customers extensively use BGP route policies to filter out unwanted learned routes from peers or prevent specific subnet prefixes from being advertised out of their Virtual Private Cloud (VPC).

Operating on a "fail open" model by default, many security-conscious organizations have adapted BGP route policies to create a "fail closed" environment — appending a "drop all" policy as the final term in their evaluation list. This helps enable absolute certainty over accepted network routes, preventing routing loops and ensuring traffic isn't BGP hijacked or inadvertently blackholed.

Dive deeper: For a foundational look at how to set up CEL expressions for route filtering, check out our deep-dive guide: Introduction to BGP policies.

2. Influencing traffic paths for active/standby architectures

Achieving optimal traffic distribution often requires forcing traffic down a specific path, whether for cost optimization or managing active/standby interconnects. Customers have used BGP route policies to influence the preferred BGP route without touching their on-premises hardware.

By dynamically modifying the BGP multi-exit discriminator (MED) attribute, network teams can make a specific peer preferred for incoming traffic. Conversely, if they want to steer traffic away from a congested or backup link, they are using AS-PATH prepending — adding one or more values to the route's AS-PATH to deprioritize it across the broader network.

Dive deeper: To see the configuration steps for managing MED and AS-Path prepending, read: Using BGP policies to influence traffic paths.

3. Solving asymmetric routing with BGP communities

One of the most advanced and highly requested use cases we’ve seen over the last year is achieving traffic symmetry. When enterprises use stateful firewalls or specific network appliances on-premises, return traffic must flow back through the exact same appliance it originated from. If it doesn't, the traffic is dropped.

Customers are successfully solving this by using BGP route policies to match against specific standard BGP communities. By tagging routes with specific communities on-premises, Cloud Router can read those tags via inbound policies and adjust the route preference by manipulating the MED accordingly. This helps ensure that Google Cloud inherently understands the stateful topology of the on-premises network and routes the return traffic symmetrically.

Dive deeper: To learn how to architect stateful traffic symmetry using BGP community tags, explore: Using BGP communities to create traffic symmetry.

Get started today

Taking control of your dynamic routing is now easier and more robust than ever. Using BGP route policies, it's a great time to optimize and secure your hybrid cloud connectivity.

We recommend testing your BGP route policies in a staging environment to verify your CEL expressions and routing logic before rolling them out to production. To explore the technical documentation, check out the BGP route policies overview.

Cloud Network Insights: end-to-end observability for the Cross-Cloud Network

Wed, 17 Jun 2026 19:30:00 +0000

In today’s digital landscape, the network is no longer confined to a single data center or even a single cloud provider. Enterprises are increasingly adopting cross-cloud strategies, connecting Google Cloud workloads to on-premises environments, other clouds like AWS and Azure, and a vast array of internet-facing applications. While this flexibility drives innovation, it can also introduce significant operational complexity. When a user experiences degradation in application performance, the critical question remains: Is it the network, the application, or something else?

We are excited to announce the general availability of Cloud Network Insights, an out-of-the-box, Google Cloud-native solution that provides comprehensive visibility into network and digital experience performance across complex multi-cloud, and hybrid environments.

Closing the visibility gap with active monitoring

Cloud Network Insights, offered in partnership with Broadcom AppNeta, expands your observability beyond Google Cloud to your entire global deployment. By utilizing active synthetic probing, the solution monitors network routes even when no user traffic is present, allowing teams to be proactive rather than reactive.

Whether the source of degradation is in the cloud, on-premises data centers, internet applications, ISPs, or last-mile connectivity, Cloud Network Insights helps you pinpoint the exact location of the bottleneck.

Cloud Network Insights integrates directly into the Google Cloud Observability suite, bringing sophisticated network intelligence into the tools you already use. With Cloud Network Insights, you get:

End-to-end network path visibility: Gain a hop-by-hop visualization of the network path between your sources and destinations. Monitor critical metrics like round-trip time (RTT), packet loss, and jitter across networks you don’t directly manage.
Digital experience insights: Go beyond the network layer to monitor digital experience for web applications. Measure DNS resolution times, HTTP response codes, and full browser page-load times to identify whether an application's degradation is due to the network or the application itself.
Proactive detection and alerting: Use synthetic testing to identify performance dips before they impact your customers. Alarms are integrated with Cloud Monitoring and Cloud Logging, enabling alerting via email, Slack, or PagerDuty.
SLA validation: Arm your team with the data needed to verify if ISPs and service providers are meeting their performance commitments.
Rapid root-cause analysis: Quickly differentiate between network problems, application-level issues, or browser performance impacts.
Integrated monitoring: Access metrics and logs directly within Google Cloud, leveraging Cloud Monitoring and Cloud Logging for dashboards and alerting. Utilize the open partner ecosystem of Google Cloud as well as support for the OpenTelemetry protocol for metrics and logs, allowing direct ingestion by OTel SDKs and collectors.
Agentic workload monitoring: Use synthetic testing to monitor connectivity and network performance to help ensure optimal connectivity to your agents and tools.

Network performance and multi-path routes to/from Google Cloud, AWS, and Azure in one view

How it works: active synthetic probing

Cloud Network Insights uses active synthetic probing technology that consists of three main components:

Monitoring Points: You deploy lightweight software agents, called Monitoring Points, into critical network segments, such as a central VPC, a remote branch, or an on-premises data center. These can be deployed as containers or virtual machines.
Synthetic probes: These Monitoring Points send small, frequent bursts of synthetic traffic (simulating a user or application) to a target destination. This allows you to monitor performance 24/7, even when no real users are on the network.
Data synchronization: The Monitoring Points send real-time performance telemetry to a central backend service. This data is then synchronized back to Google Cloud, with metrics exported to Cloud Monitoring, and alarms and events sent to Cloud Logging.

Core capabilities

Cloud Network Insights supports two primary types of monitoring to give you a full picture of your infrastructure:

1. Network performance monitoring (Layers 3 and 4)

This provides a hop-by-hop visualization of the network between a source and a destination, including.

Metrics captured: Round-trip time (RTT), packet loss, jitter, and path changes.
Single-ended mode: The agent probes an external target (like a URL, IP address or an API endpoint) that doesn't have a Monitoring Point installed.
Dual-ended mode: The Monitoring Point probes another Monitoring Point. This provides richer data, including precise one-way latency and the ability to detect asymmetric routing (when data takes a different path going out than it does coming back).

Network path metrics in Google Cloud console

2. Digital experience monitoring (Layer 7)

With digital experience monitoring, you can track the end-to-end experience of a web application. Here, you can choose from:

Browser mode: Uses a real browser engine (Selenium) to load full web pages, execute JavaScript, and render content. It measures complete page-load times to validate the actual user experience.
HTTP mode: Sends synthetic HTTP/S requests to a URL or API endpoint. This is a lightweight check for server availability, response time, and DNS/TLS performance.

Intelligence and automation

Cloud Network Insights also offers a variety of monitoring and troubleshooting capabilities.

Proactive alarms: Cloud Network Insights leverages auto-baselining to establish dynamic performance thresholds based on your historical metric data. If a metric deviates from your defined parameters, the system instantly triggers an event in Google Cloud, routing alerts directly to your team via email, Slack, or PagerDuty.
Monitoring policies: You can automate monitoring setups across large-scale environments by defining policies that dynamically create or remove paths based on custom tags. For instance, you can automatically track a core web application's performance from specific geographic regions.
Root-cause analysis: Because Cloud Network Insights extends visibility into traditionally "unwatched" areas like ISPs and transit networks, it instantly pinpoints whether a slowdown is occurring within Google Cloud, at the ISP level, or inside another cloud environment like AWS or Azure.
AI-driven insights: With integration to Gemini Cloud Assist, you can use natural language to interrogate Cloud Network Insights telemetry alongside your broader infrastructure data. Rather than manually pivoting between dashboards, ask Gemini to cross-reference specific Cloud Network Insights metrics against other Google Cloud metrics, reducing mean time to resolution (MTTR).

What customers are saying

We are already seeing strong interest from customers looking to simplify their cross-cloud operations. Organizations like Sabre and Pexip are already using Cloud Network Insights to gain clarity in their hybrid environments.

"In an environment as complex and high-scale as Sabre’s, total visibility isn't just a luxury — it's a requirement for operational resilience. Cloud Network Insights will enable us to further shift our posture towards proactive optimization. By providing granular, real-time telemetry across our global cloud footprint, it helps eliminate the traditional 'black box' of the network, allowing our teams to resolve bottlenecks before they impact the traveler experience." - Alfredo Rodriguez, VP of Cloud and Infrastructure, Sabre

“Cloud Network Insights closes the 'visibility gap' between the private corporate network and the public cloud, empowering our joint customers to pinpoint performance bottlenecks in seconds rather than hours.” - Alan Davidson, CIO, Broadcom

Get started today

Navigating complex digital ecosystems shouldn't mean sacrificing visibility. Cloud Network Insights bridges the gap across multi-cloud and hybrid environments by combining deep network performance metrics with digital experience monitoring. Coupled with direct integrations into Google Cloud Observability and Gemini Cloud Assist, your teams are empowered with intelligent alerting, robust SLA validation, and rapid root-cause analysis. We look forward to helping you gain a clearer, unified view of your Cross-Cloud Network.

You can get started in the Google Cloud console today. To learn more:

Explore our product documentation for deep dives into deploying Monitoring Points and configuring policies.
Check out the latest release notes to stay updated on new features.
Watch the overview video
Hear more about the partnership between Google Cloud and Broadcom:

Report: GKE Inference Gateway delivers up to 92% faster AI responses

Tue, 09 Jun 2026 16:00:00 +0000

As generative AI moves from experimental pilots to massive production environments, the efficiency of your infrastructure becomes the ultimate differentiator. One way to get the most out of it and minimize costly accelerator idle time is to leverage the Google Kubernetes Engine (GKE) Inference Gateway, which intelligently routes generative AI workloads based on real-time model server metrics.

Instead of relying on traditional, naive round-robin load balancing — which frequently triggers expensive accelerator recomputation and spikes user latency — this native extension of the GKE Gateway utilizes advanced capabilities like prefix caching and model-aware routing. By ensuring requests land on the exact accelerator that is primed to process them right away, GKE transforms how you can serve your large language models (LLMs), with excellent hardware utilization and ultra-fast response times.

In fact, according to an independent benchmark report, GKE Inference Gateway outperforms the next leading managed Kubernetes service with 15.7% higher throughput, 92.8% shorter wait times, and 62.6% lower inter-token latency. This performance takes LLM-based applications from sluggish and expensive to fast and production-grade.

That performance tracks with Snap’s experience using GKE Inference Gateway.

“At Snap, we are integrating llm-d into our production AI infrastructure to facilitate high-performance inference at scale. By employing prefix-cache-aware routing, we have achieved prefix cache hit rates ranging up to 75-80%. We appreciate the open-source nature of llm-d, as it enables seamless integration with our Envoy-based Service Mesh.” - Vinay Kola, Senior Manager, Software Engineering, Snap Inc.

In this blog, we take a closer look at GKE Inference Gateway’s prefix caching, complete with examples. We also provide more details about its benchmark results. Let’s jump in.

The secret to low-latency AI: Prefix caching

Prefix caching optimizes LLM performance by storing the KV cache (activation states) of long, repetitive prompt prefixes. When consecutive user requests share the same system instructions, context, or documentation, the model entirely skips reprocessing those tokens. GKE Inference Gateway reads incoming request prefixes and matches them to the specific pods that already hold that data in memory. This eliminates the "thinking" tax on your GPUs and TPUs, turning heavy reasoning loops into near-instant answers.

Use case 1: Documentation and codebase Q&A with retrieval-augmented generation (RAG)

When querying massive enterprise repositories, you can ground your LLMs’ responses without any added latency by pinning entire documentation sets as static cached prefixes, using RAG.

Instead of forcing an LLM to re-read thousands of lines of API references or corporate wikis for every single user question, GKE Inference Gateway routes the query to a pod that already has that specific context warmed up in its KV cache. The LLM only has to compute the user's brief, dynamic question, completely bypassing expensive document re-evaluation.

code_block: <ListValue: [StructValue([('code', '[STATIC PREFIX - STAYS IN CACHE] You are an expert AI assistant specializing in technical documentation. Below is the complete API documentation for our software platform. Use this context to answer the user\'s questions accurately. If the answer cannot be found in the documentation, say "I cannot find that in the provided context." \r\n\r\n<documentation> [10,000+ words of API reference documentation, endpoints, error codes, etc.] </documentation> \r\n\r\n[DYNAMIC SUFFIX - CHANGES PER REQUEST] User Question: How do I handle a 429 rate limit error using the Python SDK?'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb06534e100>)])]>

Use case 2: Multi-turn chat

You can also use prefix caching to maintain customer service interactions across thousands of simultaneous sessions without compounding compute costs. You can do so by caching permanent system personas and core business rules directly on the LLM server.

In enterprise chat architectures, the base system prompt and reference tables remain completely identical across millions of customer interactions. GKE Inference Gateway handles these multi-turn conversations using context-aware routing to bypass repetitive token processing, so that your chatbot stays ultra-responsive even under peak traffic.

code_block: <ListValue: [StructValue([('code', '[STATIC PREFIX - STAYS IN CACHE] \r\n-System Persona: You are "FinBot", a helpful, empathetic, and compliant virtual assistant for ABC Banking Solutions. You must strictly adhere to the following rules: 1. Never provide concrete investment advice. 2. Always verify if the user is asking about checking or savings. 3. Keep your answers under 3 sentences. 4. If a user is angry, offer to connect them to a human manager. \r\n\r\nHere is the current interest rate table for May 2026: \r\n- Savings: 4.2% APR \r\n- Checking: 0.5% APR \r\n- CD (12-month): 5.1% APR \r\n\r\n[DYNAMIC SUFFIX - CHANGES PER REQUEST] User: Hi, I\'m trying to figure out how much I\'d make if I locked away $10,000 for a year?'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb06534e190>)])]>

GKE outperforms alternative managed Kubernetes solutions

To validate these architectural advantages, Principled Technologies recently released an independent benchmark report comparing GKE (equipped with the GKE Inference Gateway) against a standard third-party managed Kubernetes service utilizing conventional round-robin HTTP load balancing.

Tested on a Llama 3.1 8B Instruct shared prefix workload using identical hardware (eight NVIDIA A100 40GB GPUs) the results reveal a massive performance gap between the two Kubernetes services. GKE didn't just win; it completely redefined inference efficiency across three critical metrics:

Higher throughput: 15.7% more tokens processed per second, enabling higher request capacity or reduced hardware needs for the same workload
Much faster time to first token (TTFT): 92.8% shorter wait times, producing dramatically quicker perceived response starts for interactive scenarios
Lower inter-token latency (ITL): 62.6% reduction, resulting in smoother and faster token streaming after the first token

Figure 3: Mean latency (normalized time per output token) of GKE with GKE Inference Gateway and third-party managed Kubernetes service on the Llama 3.1-8B Instruct LLM on the Shared prefix use case. Both solutions used the same hardware. Source: Principled Technologies

	GKE	3rd party ManagedKubernetes Service	GKE Advantage
Mean outputtoken throughput	7,169.21 outputtokens per second	6,042.05 outputtokens per second	15.7% more outputtoken throughput
Mean time tofirst token (TTFT)	188.36 ms	2624.73 ms	92.8% less TTFT
Mean inter-tokenlatency (ITL)	30.20 ms	81.03 ms	62.6% lower ITL

Figure 4: GKE with GKE Inference Gateway delivered superior AI inference compared to a third-party managed Kubernetes service using standard HTTP LB.

Ready to accelerate your gen AI inference workloads?

Whether you’re deploying inference workloads such as real-time customer support agents, dynamic coding assistants, or sub-second fraud detection models, infrastructure latency dictates your user experience. By ensuring shared prompt prefixes hit the active cache nearly 100% of the time, GKE Inference Gateway transforms your LLMs from sluggish, expensive reasoning engines into rapid, capital-efficient, production-grade powerhouses.

Ready to explore the performance advantage that GKE Inference Gateway can bring to your gen AI workloads? Access the full benchmark report here and watch this explainer video to learn more.

^{A special thanks to Dan Sullivan, Senior Performance Architect, Principled Technologies.}

Experimenting with TPUs, GKE Managed DRANET, and Multi-cluster Inference Gateway

Tue, 02 Jun 2026 07:00:00 +0000

What happens when your workload fails in one region but you need access to service? This is a common case for availability and uptime. With recent enhancement to the Kubernetes ecosystem and capabilities like Dynamic Resource Allocation (DRA) and Inference Gateway. I decided to experiment with these capabilities in Google Cloud for a simple test using an AI inference workload.

In this blog, we will explore this setup and you can also jump straight into the detailed configs in this codelab Build multi-cluster GKE Inference Gateway, with TPUs , Cloud Storage FUSE and managed DRANET.

Building blocks

To build out this experiment, use the following products, features, and tools:

Google Kubernetes Engine (GKE) managed DRANET: This is a managed feature that lets you request and share resources among Pods. This supports GPUs, and TPUs. In this test TPUs were used in two different regions with networking assigned using managed DRANET.
Multi-cluster GKE Inference gateway: Load balances your AI/ML inference workloads across multiple GKE clusters. This works in a failover situation which is what my experiment intended to test. The type which supports this is the Multi-cluster Cross-region internal Application Load Balancer gke-l7-cross-regional-internal-managed-mc
Cloud Storage FUSE: Provides a way to store data, models, checkpoints, and logs directly in Cloud Storage. To speed up the deployment, an open source gemma model was downloaded to this storage for retrieval.
Virtual private Cloud (VPC): The foundational global network providing isolated, secure communication for the internal load balancers and compute nodes
GKE Fleets: Fleets group the separate regional clusters under a unified management control plane
TPU v6e: Google's custom AI accelerators that provide the high-performance compute required to serve the model. The VM family type used was the ct6e-standard-4t in a 2x2 Slice

Design pattern example

The aim is to deploy a LLM model (Gemma 3) onto 2 GKE clusters in different regions. Each cluster will use 4 TPU v6e chips. The model should be stored in Cloud Storage. The workload is served using GKE Inference Gateway which supports multi-clusters. The traffic should be routed to the region closest to the user and failover to the other region if one region fails.

Putting it together

To get access to the TPUs for your project in two regions you have to ensure you have the necessary quota in those regions.

Begin: Set up the environment.

Create a standard VPC, with firewall rules and subnet in the same zone as the reservation.
Create a proxy-only subnet this will be used with the Internal regional application load balancer attached to the GKE inference gateway
Set up firewall rules allowing traffic and health checks.
Reserve static internal IP addresses in both regions for the Gateway.
Provision a Cloud Storage FUSE bucket and configure a dedicated IAM Service Account. Bind this to a Kubernetes Workload Identity so your pods can securely mount the bucket and read the model weights directly.

Next: Create standard GKE clusters and node pools.

Deploy two separate GKE clusters in your chosen regions configured.
Enable the Gateway API (--gateway-api=standard) and the Cloud Storage FUSE CSI driver (--addons GcsFuseCsiDriver) during cluster creation.
Create dedicated TPU v6e node pools (ct6e-standard-4t) for both clusters.
Enable managed DRANET on these TPU node pools by setting the flags ---accelerator-network-profile=auto, and --node-labels=cloud.google.com/gke-networking-dra-driver=true

Next: Establish the global mesh via Fleet Registration.

Register both GKE clusters to a unified GKE Fleet by following the fleet creation and registration setup.
Enable Multi-Cluster Service Discovery and Multi-Cluster Ingress on your fleet.
Designate your primary region as the configuration hub to act as the control plane for routing rules across both regions.

Next: Deploy the AI workload.

Use a temporary Kubernetes job to download the Gemma 3 (gemma-3-27b-it) model weights directly into your Cloud Storage bucket.
Define a ResourceClaimTemplate that explicitly requests the managed DRANET device class (deviceClassName: netdev.google.com ) with the allocation mode set to "All".

code_block: <ListValue: [StructValue([('code', 'apiVersion: resource.k8s.io/v1\r\nkind: ResourceClaimTemplate\r\nmetadata:\r\n name: all-netdev\r\n namespace: default\r\nspec:\r\n spec:\r\n devices:\r\n requests:\r\n - name: req-netdev\r\n exactly:\r\n deviceClassName: netdev.google.com\r\n allocationMode: All'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb065bb95e0>)])]>

Deploy your inference server (e.g. vLLM) on the TPU nodes in both regions. Ensure the pod spec utilizes node selectors for the 2x2 TPU topology, requests exactly 4 TPUs, and mounts the netdev claim. This guarantees your pods utilize the dedicated accelerator networking alongside standard Ethernet.

Next: Configure the Multi-Cluster Inference Gateway.

Install the necessary Custom Resource Definitions (CRDs) so Kubernetes can process specialized routing objects like the InferenceObjective.
Deploy an AutoscalingMetric to track hardware utilization, such as KV cache usage.
Use Helm to group the independent AI deployments from both regions into a single, logical InferencePool.
Deploy the Cross-Region Gateway and its associated HTTPRoute to manage incoming global traffic.
Apply health checks and backend policies to the pool to ensure load balancing relies on your custom hardware metrics.

Configure an InferenceObjective to instruct the gateway to route prompts to the region with the highest availability, avoiding overloaded TPUs.

code_block: <ListValue: [StructValue([('code', 'apiVersion: gateway.networking.k8s.io/v1\r\nkind: Gateway\r\nmetadata:\r\n name: cross-region-gateway\r\n namespace: default\r\nspec:\r\n gatewayClassName: gke-l7-cross-regional-internal-managed-mc\r\n addresses:\r\n - type: networking.gke.io/named-address-with-region\r\n value: "regions/europe-west4/addresses/gemma-gateway-ip-europe-west4"\r\n - type: networking.gke.io/named-address-with-region\r\n value: "regions/us-east5/addresses/gemma-gateway-ip-us-east5"\r\n listeners:\r\n - name: http\r\n protocol: HTTP\r\n port: 80\r\n---\r\napiVersion: gateway.networking.k8s.io/v1\r\nkind: HTTPRoute\r\nmetadata:\r\n name: gemma-route\r\n namespace: default\r\nspec:\r\n parentRefs:\r\n - name: cross-region-gateway\r\n kind: Gateway\r\n rules:\r\n - backendRefs:\r\n - group: networking.gke.io\r\n kind: GCPInferencePoolImport\r\n name: gemma-pool\r\n port: 8000\r\n---\r\napiVersion: networking.gke.io/v1\r\nkind: HealthCheckPolicy\r\nmetadata:\r\n name: gemma-health-check\r\n namespace: default\r\nspec:\r\n targetRef:\r\n group: networking.gke.io\r\n kind: GCPInferencePoolImport\r\n name: gemma-pool\r\n default:\r\n config:\r\n type: HTTP\r\n httpHealthCheck:\r\n requestPath: /health\r\n port: 8000\r\n---\r\napiVersion: networking.gke.io/v1\r\nkind: GCPBackendPolicy\r\nmetadata:\r\n name: gemma-backend-policy\r\n namespace: default\r\nspec:\r\n targetRef:\r\n group: networking.gke.io\r\n kind: GCPInferencePoolImport\r\n name: gemma-pool\r\n default:\r\n timeoutSec: 100\r\n balancingMode: CUSTOM_METRICS\r\n trafficDuration: LONG\r\n customMetrics:\r\n - name: gke.named_metrics.tpu-cache\r\n dryRun: false\r\n maxUtilizationPercent: 60\r\n---\r\napiVersion: autoscaling.gke.io/v1beta1\r\nkind: AutoscalingMetric\r\nmetadata:\r\n name: tpu-cache\r\n namespace: default\r\nspec:\r\n selector:\r\n matchLabels:\r\n app: gemma-server\r\n endpoints:\r\n - port: 8000\r\n path: /metrics\r\n metrics:\r\n - name: vllm:kv_cache_usage_perc\r\n exportName: tpu-cache\r\n---\r\napiVersion: inference.networking.x-k8s.io/v1alpha2\r\nkind: InferenceObjective\r\nmetadata:\r\n name: gemma-objective\r\n namespace: default\r\nspec:\r\n priority: 10\r\n poolRef:\r\n name: gemma-pool\r\n group: "inference.networking.k8s.io"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb065bb9d90>)])]>

Testing the Failover

Verify the highly available architecture by simulating a primary region outage. Once the primary deployment is taken offline, the Gateway automatically detects the failure and seamlessly reroutes all subsequent user requests to the active secondary cluster, ensuring continuous availability without dropping traffic.

Next Steps

Take a deeper dive into a hands-on codelab and more information on these features review the following.

Hands-on Codelab: Build multi-cluster GKE Inference Gateway, with TPUs , Cloud Storage FUSE and managed DRANET
Document set: DRANET
Documentation: AI Hypercomputer

Want to ask a question, find out more or share a thought? Please connect with me on Linkedin.

How we evolved Google’s global and data center networks for the AI era

Tue, 26 May 2026 16:00:00 +0000

Over the last 25 years of building Google’s global network, we’ve navigated major architectural eras — from the Internet, to streaming, and the cloud. Today, we are squarely in the midst of a fourth: the AI era. The applications in the AI era are fundamentally different from the consumer and enterprise applications of the previous eras and impose a set of novel and demanding requirements — on compute resources, of course, but also on the network.

Consider the fundamental physical challenge, which is that it is far more difficult to move electrons (electrical power) than it is to move photons (data over fiber). Because the demand for AI compute frequently outpaces the space and power capacities of individual facilities, we strategically locate data centers near sustainable energy sources, or in locations with pathways to add clean energy sources to the local grid. Then, by utilizing the network to distribute AI workloads across campuses, we create a massive-scale, pooled hypercomputing resource that overcomes the power limitations of any single site.

To deliver this, we created an end-to-end, vertically integrated AI technology stack that comprises everything from chips to systems, to platforms and application and agentic ecosystems. This stack includes a portfolio of pre-built agents and applications; our Gemini Enterprise Agent Platform for you to build, scale, govern, and optimize your AI-enabled applications; world-class AI models; as well as our unified data platform. All this is anchored by our AI Hypercomputer, a unified infrastructure that combines purpose-built hardware and open software, and that comes with flexible consumption options. Our network, forged through decades of innovation, is the essential fabric of the AI Hypercomputer.

The network supporting this stack must meet the stringent bandwidth, scale, and performance needs of AI workloads. This applies not only within the campus, where the network must scale up and out, but also across the wide area network (WAN) along with high-bandwidth interconnects, to bring AI training data from its source to AI compute resources.

To address these challenges, we’ve reimagined three key pillars of our network infrastructure: the fabric inside the AI Hypercomputer, the fabric across the AI Hypercomputer, and our global network. Let’s take a closer look at each of these.

1. The fabric inside AI Hypercomputer

The massive scale of today’s AI models, fueled by the explosive growth of foundational AI model parameters, makes AI training very compute- and network-intensive.

This necessitates an exponential increase in required network bandwidth, with strict bounds on delay (e.g., tail latency) to accommodate AI workloads’ peculiar traffic patterns, which are characterized by sensitivity to performance variation and synchronized bursts, i.e., intense, coordinated, millisecond-level traffic spikes. Furthermore, since large-scale training jobs are uniquely vulnerable to failures and performance stragglers, maintaining high reliability and predictable performance is absolutely essential.

To address the scale, low latency, and high predictability that modern AI workloads require — as well as protection from extreme bursts — we’ve adopted a "campus as a computer" philosophy, decoupling our network into three distinct domains:

a scale-up domain for intra-pod connectivity
a dedicated east-west scale-out accelerator fabric
the Jupiter frontend network for north-south compute and storage access

This decoupled architecture provides three strategic advantages: it allows domains to evolve independently for faster innovation; provides a non-blocking scale-out network with massive training bandwidth; and helps ensure the network can be co-designed in lockstep with new ML accelerators, for superior hardware support.

Recently, we unveiled Virgo Network, our scale-out data center fabric specifically engineered for modern AI. Virgo utilizes high-radix switches and a flat, two-layer non-blocking topology to provide massive bisection bandwidth, while minimizing latency by reducing network tiers. Its multi-planar design, featuring independent control domains for each plane, provides hardware-level resilience and fault isolation. Furthermore, Virgo can expand across multiple data centers, removing physical building limitations and enabling flexible AI compute scaling.

The effectiveness of our network and accelerator codesign is perfectly illustrated by the recently debuted eighth generation TPUs. Within this architecture, Virgo Network can link 134,000 TPU 8t chips with up to 47 petabits/sec of non-blocking bi-sectional bandwidth in a single fabric. Virgo Network delivers up to 4x the bandwidth per TPU 8t accelerator over the previous generation, and 40% lower unloaded fabric latency for TPU 8t compared to the previous generation network for TPUs. In this setup, Virgo Network manages the raw accelerator traffic, while Jupiter provides reliable and rapid access to the global WAN and storage. When integrated with Pathways and JAX, this AI Hypercomputer networking engine facilitates near-linear scaling for up to a million TPU 8t chips in a single logical cluster.

Autonomous reliability: protecting workload goodput

Building a resilient megascale fabric represents only part of the challenge. In a cluster of hundreds of thousands of chips, hardware failures are a statistical certainty. A single stalled instance can stop an entire synchronous training job, wasting valuable compute cycles. As such, efficient fault localization is critical.

We engineered Virgo Network with autonomous reliability capabilities to maximize workload efficiency at scale, a.k.a., goodput. Expanding on our existing straggler detection, Virgo Network now also features automated hang detection. The moment a fail-stop event occurs, our specialized agents immediately localize the fault, isolate the faulty instance, and enable you to restore the training job from a checkpoint — getting your training timeline back on track, with minimal manual intervention. Learn more by watching this demo:

To complement these capabilities, we also use high-resolution, sub-millisecond telemetry to identify elusive network micro-bursts that are usually missed by conventional 30-second monitoring intervals. These high-resolution telemetry advancements enable more efficient network operations, better provisioning, and a lower mean time to recovery.

2. The fabric across AI Hypercomputer

The exponential growth of modern AI workloads requires us to scale and distribute AI workloads across multiple campuses over a WAN. At the same time, traditional networks weren’t built for the high bandwidth and extreme burstiness of AI traffic, and often fail to detect microbursts that can lead to severe performance degradation. We have developed a suite of innovations to optimize WAN performance for cross-site AI deployments, including:

A multi-shard global network that enables horizontal scaling. Our global network sustained a 10X WAN traffic growth from 2020 to 2025.
Tuning the fabric for essential availability, latency, and quality of service (QoS) attributes. Real-time microburst management helps ensure fair bandwidth allocation and infrastructure isolation across our multi-tenant infrastructure.
Multi-shard isolation to ensure each network shard operates with its own control, data, and management planes.

Combined with regional isolation and Protective Reroute, this architecture minimizes failure impact and shortens user-visible outages — delivering the beyond-nines reliability essential for AI workloads.

Providing high-speed, flexible, and cost-effective interconnectivity is also a priority. AI training relies on vast datasets that are often located on-premises or across various clouds. Given the high cost of AI compute, minimizing idle time is essential; for instance, upgrading from a 100 Gbps link to a 3.2 Tbps connection reduces the time to transfer a petabyte of data from 22.2 hours to just 0.7 hours — a 97% reduction in AI compute idle time spent waiting for data. Our AI-native Cloud Interconnect is purpose-built for the high-bandwidth and low-latency needs of AI workloads, featuring an optimized data path with 400 Gbps links that scale in 3.2 Tbps increments to reach petabit-per-second capacity. It also offers traffic differentiation and flexible connection options, including direct fiber peering and colocation facilities. AI-native Cloud Interconnect supports petabit-scale data transfer with reliable, private connectivity necessary for your cross-cloud AI training and serving.

3. A resilient global network for the age of inference

Applications serving AI inference to a global user population or supporting an agentic enterprise are far more demanding than conventional web apps. The need for opportunistic use of expensive AI compute available at distant locations, distributed service dependencies, and the burstiness of the traffic demand high bandwidth network with a global footprint, as well as deep peering to SaaS providers, ISPs, and hyperscalers. To maintain responsiveness and "always-on" availability, applications need low latency and a highly resilient network.

With its connectivity, scale, and resilience, Google’s global network is well-equipped to handle the demands of the age of AI inference. Our network spans more than 10 million kilometers of terrestrial and subsea fiber, connects our 43 cloud regions, and features 200+ edge locations, providing the essential footprint for serving AI inference. Our Premium Tier network delivers the low latency and reliability needed for consistent, high-quality global user experience. By optimizing traffic entry and exit points, the network significantly boosts application performance, with resilience at the core of this "always-on" infrastructure.

Building the future, together

As a Google Cloud customer, these network innovations are built directly into your environment. Google’s network delivers the massive scale, capacity, reliability and performance essential for your AI workloads.

The AI era demands more than just raw compute; it necessitates a robust network fabric to scale. Our vertically integrated AI technology stack — from silicon to software ecosystems — is powered by the AI Hypercomputer to accelerate your transformation and make AI helpful for everyone. Whether through our megascale fabric, resilient global network for inference, or AI-native Cloud Interconnect, we ensure your AI journey is efficient and reliable. We look forward to building this future with you.

What’s new with the Cross-Cloud Network at Next ‘26

Wed, 22 Apr 2026 12:00:00 +0000

While generative AI sparked a revolution, the true paradigm shift is the rapid evolution from standalone AI models to multi-agent autonomous systems. In this new era, the network transcends basic connectivity to become the critical integration layer for your agentic enterprise.

As AI agents and services surge, your core applications remain as vital as ever. To thrive in this rapidly evolving landscape, you need a planet-scale network to connect, protect, govern, deliver, and secure all your users, data, agents, AI services, and core applications across clouds and on-premises.

Google Cloud's Cross-Cloud Network provides this unified foundation, and is now used by 65% of the Fortune 100 and handles up to 27 exabytes of data per month. At Google Cloud Next, we are introducing networking innovations to accelerate your AI infrastructure, strengthen security, and simplify operations.

Optimized networking infrastructure for AI

As we move toward an agentic world, the network must support massive-scale inference paired with reinforcement learning. At Google, we’ve spent years refining this cycle to power our own global AI services. Today, we’re announcing AI infrastructure network innovations that bring this same architecture directly to your workloads, across agents, inference, training, and beyond.

Networking for agents

The Gemini Enterprise Agent Platform is a comprehensive enterprise environment designed to build, scale, govern, and optimize the next generation of autonomous agents. Key innovations being announced in preview include:

Agent Gateway: Air-traffic control for agentic traffic

Agent Gateway understands MCP and A2A agentic protocols and provides an open, extensible, scalable way to enforce centralized governance policies to securely connect agents, models, and tools across runtimes.

Ambient networking: A seismic shift in service-to-service connectivity

Ambient networking, a new integrated data plane for Google Kubernetes Engine (GKE) and Cloud Run, provides service discovery, zero-trust access, and traffic management without the need for complex and resource-heavy sidecar proxies. It reduces operational overhead and enables up to a 10x reduction in GKE resource usage for layer 4 (L4) mesh capabilities

Ambient networking underpins two new capabilities:

Service bindings automatically establish service-to-service connectivity, allowing developers to move faster to build and scale their agentic applications and services.
Network Services Monitoring bridges application and network observability gaps resulting in faster root-cause analysis and simplified troubleshooting.

Rich partner integrations and customizations

With the help of Service Extensions, we are developing solutions for identity, governance, and AI security for agent-to-anywhere traffic. Coming soon in preview to Agent Gateway are:

Identity and governance administration: Offering delegated authorization to Cloud IAM and partner services from Okta, Ping, Saviynt, and Silverfort to enforce real-time, contextual governance policies based on application and business context.
Runtime security: As a universal enforcement point by integrating with Google Cloud’s Model Armor and partner solutions from Broadcom, Check Point, Cisco, CrowdStrike, Exabeam, F5, Netskope, Palo Alto Networks, Thales, and Zscaler. Together, these can help to secure agentic communications against emerging AI attack vectors.

These innovations are built on an open foundation including Envoy and Kubernetes, providing strong, integrated governance in multicloud environments using standard Kubernetes Gateway APIs.

Networking for inference

At Google we run inference at scale with optimized use of distributed GPU and TPU resources, automatic failover between regions for high availability, and optimized global request routing for fast end-user performance. GKE Inference Gateway delivers these capabilities to our cloud customers including the following new innovations:

Multi-region support allows scaling inference services across regions, enabling cross-regional failover, optimized utilization, and reduced global latency (preview).
Predictive latency boost improves utilization with intelligent request routing based on predefined performance targets (preview).
Disaggregated serving leverages llm-d’s SGLang support, offering the flexibility to choose between vLLM and SGLang for model serving (GA).

Gemini Enterprise Agent Platform reduced Time to First Token (TTFT) latency by over 35% for Qwen3-Coder by using GKE Inference Gateway.

“Before GKE Inference Gateway, managing our inference stack with Ray Serve created a complex, dual-orchestration layer that was a significant burden on our small operations team. Moving to the Inference Gateway and native Kubernetes deployments was the 'North Star' architecture we needed to simplify management and achieve robust production stability with a GKE-native batteries-included solution.” - Mikhail Lubinets, Lead HPC Engineer, Technology Innovation Institute

Networking for training

At Google, we build and run the largest AI models in the world — and we built a network to support that. Some of the new enhancements are:

Massive scale with Virgo Network

This new non-blocking data center fabric removes latency barriers:

Virgo can link up-to 134,000 chips with 47 Petabits/sec of non-blocking bi-sectional bandwidth to deliver 1.7K Exaflops of compute.
With enhancements in Pathways and JAX, you can further connect these Virgo fabrics to scale to over 1 million TPU chips in a single training cluster.
We are also making Virgo Network available on NVIDIA Vera Rubin NVL72, supporting up to 960,000 GPUs.

For more on Virgo Network, check out this blog.

Accelerator network profiles

It’s easier than ever to handle the complex networking prerequisites for accelerator-equipped GKE node pools with DRANET, which improves bandwidth for distributed AI/ML workloads by up to 60% (GA).

AI-native Cloud Interconnect

SLA-backed, and optimized for efficiency, Cloud Interconnect supports petabit-scale data transfers and is available with a fixed price option. Cloud Interconnect now supports:

400 Gbps circuits with up to 3.2 Tbps in a single connection (GA)
Partner Cross-Cloud Interconnect for AWS (GA), CoreWeave (in preview soon), and Lumen (in preview soon)

Cross-Cloud Network for AI and core applications

The Cross-Cloud Network helps ensure you can securely connect users, data, locations, applications, services, and infrastructure anywhere in the world, at planetary scale. We designed our global multi-shard network to scale horizontally to meet the demands of the AI era and enable us to accommodate our 10x WAN traffic growth from 2020 to 2025.

These are some of the improvements we’re making to the Cross-Cloud Network:

Ultra Low Latency Solution for financial exchanges

In partnership with CME Group, we are bringing the world's leading derivatives marketplace to Google Cloud. To support CME Group’s performance requirements, we developed an ultra low latency (ULL) networking and compute solution. This fully managed cloud environment will allow CME Group and its clients to migrate its core trading systems to Google Cloud.

Now in preview, the solution is designed to meet the unique and exacting requirements of running financial exchanges in the cloud. It includes several new technologies:

Deterministic high-performance compute powered by ULL networking, with bare metal and VM form factors, delivers a comprehensive portfolio for your trading compute needs.
Scalable multicast data distribution with hardware-based ultra-low latency enables reliable one-to-many market data sharing.
Nanosecond-level clock sync enabled by Firefly, a novel clock synchronization system. Firefly achieves sub-10ns NIC-to-NIC synchronization to support high-frequency trading.
Advanced network observability with 64-bit nanosecond timestamps, support for multiple traffic-mirroring destinations and multicast traffic, and support for auditing and regulatory requirements.
Low-latency inference allowing exchange participants to connect their AI-driven services to the exchange’s infrastructure.

“The Google Cloud Ultra Low Latency Solution provides the level of performance necessary for CME Group futures and options markets to run in the cloud, expanding access to clients worldwide.” - Sunil Cutinho, CIO, CME Group

Cross-cloud observability for networks, applications, and agents

Whether you’re running core applications or new AI agents, you need visibility into your network infrastructure. Cloud Network Insights, now in preview, offers network performance monitoring (NPM) and digital experience monitoring (DEM) to dramatically reduce the mean time to detect and mitigate network-related agent, application, and API issues.

Cloud Network Insights is enabled by technologies from Broadcom’s AppNeta and powered by AI-enabling natural language queries through Gemini Cloud Assist.

"In an environment as complex and high-scale as Sabre’s, total visibility isn't just a luxury — it's a requirement for operational resilience. Cloud Network Insights will enable us to further shift our posture from reactive troubleshooting to proactive optimization. By providing granular, real-time telemetry across our global cloud footprint, it helps eliminate the traditional 'black box' of the network, allowing our teams to resolve bottlenecks before they impact the traveler experience." - Alfredo Rodriguez, VP Cloud Platform Infrastructure, Sabre Corporation

Cross-Cloud Network for distributed applications

Multicloud and hybrid networks require secure, reliable, and high-performance connectivity. New enhancements for our foundational networking services and tools include:

Private Service Connect

Private Service Connect traffic volume grew 4x in 2025 and it now supports 40+ Google and third-party published services, enabling secure private global access to your managed services.
Private Service Connect endpoint-based security allows for granular authorization policies for producer-to-consumer service communications (preview).
Gemini Cloud Assist for Private Service Connect provides for automated troubleshooting (preview).

Cloud-native IP address management (IPAM)

Cloud Number Registry is an IPAM solution powered by agentic technologies. Network admins can easily find free IP ranges, track utilization, and allocate resources (preview). It also integrates with Infoblox Universal DDI for Cross-Cloud Network IPAM discovery and enforcement.
Hybrid Subnets allow you to migrate legacy workloads from on-premises to a VPC without needing to change hard-coded IP addresses (GA).
Cloud NAT allows you to connect your IPv6-only workloads to private IPv4 destinations using the combined power of DNS64 and private NAT64 (in preview soon).

Network Connectivity Center (NCC)

Partner Cross-Cloud Interconnect for AWS is available as a connectivity type in NCC (preview).
Support for static routes using an internal load balancer as the next hop allows the integration of Secure Web Proxy and third-party network security virtual appliances (GA).
Support for privately used public IP (PUPI) allows the exchange of PUPI IPv4 addresses with VPC spokes and producer VPC spokes (GA).

Granular networking charge visibility

Cost Explorer and the new App Optimize API now provide attribution of associated Data Transfer costs to the originating resources for Google Cloud products (in preview soon).

Cross-Cloud Network for internet-facing services

As part of Cross-Cloud Network, the Global Front End simplifies how you deliver, scale, and protect web, API, and AI workloads. New capabilities include:

Global Front End Enterprise delivers simplified consumption by combining capabilities from global Cloud Load Balancing, Google Cloud Armor, Cloud CDN, and Service Extensions with up to 15% lower TCO (in preview soon).
Post quantum cryptography (PQC) helps secure your workloads with industry-standard algorithms that provide a layered defense against both classical and quantum adversaries.
Google tag gateway, enabling advertisers to serve tags from their own domain, which can significantly improve the accuracy and resilience of measurement signals (GA soon).

In addition, Cloud CDN, an important part of the Global Front End, now offers:

Built-in image optimization to help you deliver content that best fits your end users’ screens and saves on bandwidth costs (in preview soon).
GKE Gateway support so you can enable and manage caching services using GKE APIs (GA).

Cross-Cloud Network’s Cloud WAN for global enterprises

Cloud WAN is a fully managed, reliable global backbone to connect your enterprise. New capabilities include:

Expanded geographic reach: Our network spans more than 10 million kilometers of terrestrial and subsea fiber, and Network Connectivity Center’s site-to-site data transfer is now available in over 25 countries.
NCC Gateway enables third-party secure service edge (SSE) integrations from Palo Alto Networks (GA soon) and Symantec (preview).
The Verified Peering Provider program, which offers highly reliable internet connectivity to Google, now has dramatically expanded availability through 175+ providers worldwide.
Last mile connectivity: Provision site-to-cloud private connectivity in minutes with preferred partners from the Google Cloud console (in preview soon).

“Cloud WAN enables Dun & Bradstreet to evolve our global network via composable, cloud-native constructs. Leveraging NCC, we’ve built a resilient, high-performance platform that simplifies operations and optimizes costs. This foundation supports continued modernization and AI-driven workloads. We expect to extend this architecture as new patterns emerge, maintaining our blueprints-first approach.” - Josh Barry, VP, Network Engineering, Dun & Bradstreet

AI-powered security against evolving threats

The threat landscape is evolving faster than ever, with AI-driven attacks. Staying ahead requires the latest defenses. Cross-Cloud Network relies on Cloud NGFW and Cloud Armor for advanced security capabilities. Here’s the latest on those offerings.

Cloud NGFW

Advanced malware sandbox uses AI models trained on data from 70k+ customers to stop 99% of known and unknown malware, including evasive zero-days. Advanced malware sandbox is powered by Palo Alto Networks Advanced Wildfire (in preview soon).
Internal Application and proxy Network Load Balancer support helps to enforce consistent, service-centric security for abstracted services like GKE, Cloud Run, and Private Service Connect traffic (preview).
Project-level policies allow for creating and managing Cloud NGFW endpoints, security profiles, and security profile groups at the project level (in preview soon).

Cloud Armor

Managed rules, built-in rulesets across 15 threat categories, deliver automated threat protection against a broad set of attacks and zero-day CVEs. This is powered by Thales Imperva based on visibility to 1.5 trillion web requests each month (in preview soon).
Google Cloud Fraud Defense integration helps to discern the legitimacy and authorization of bots, humans, and agents. Fraud Defense is the evolution of reCAPTCHA, which protects over 14 million domains from fraud and abuse.
Adaptive protection for Network Load Balancers & VMs brings advanced machine learning to L3/L4 traffic, to detect and mitigate volumetric DDoS attacks (in preview soon).
A simplified user experience with a visual rule builder makes custom rule creation easier (in preview soon).

AI-powered network operations

Finally, new AI-powered technologies in Gemini Cloud Assist can help automate manual tasks, ease troubleshooting, predict reliability issues, improve security, and help optimize your network to reduce toil and improve reliability with new specialist agents. These include:

A network security agent that streamlines network security operations by assisting with policy generation, recommendations, and impact analysis (in preview soon).
A network agent that optimizes workload placement for performance and reliability, and also provides advanced cost estimation for observability services (in preview soon).

Additionally, to enable customers and partners to build their own agents, we are releasing Network observability MCP tools and agent skills. This will allow their agents to leverage connectivity tests, and allows for natural language querying of VPC Flow Logs (both in preview).

The network that scales with you

We built our Cross-Cloud Network on the same global infrastructure that powers Google’s largest AI and internet services. This provides you with a blazing-fast, planet-scale foundation that is both secure by design and open by principle, allowing you to integrate your trusted partners across any environment.

As we move into the agentic era, our flexible, future-proof solutions ensure you can quickly adopt the latest AI technologies while maintaining the reliability of your core applications.

Whatever comes next, we’ve built the network to help you lead it. Attend our networking sessions at Next ’26 to learn more, or learn more about the Cross-Cloud Network!

Introducing Virgo Network, Google’s scale-out AI data center fabric

Wed, 22 Apr 2026 12:00:00 +0000

The AI era requires a fundamental rethink of physical cloud architecture — networking, in particular. With foundational model parameters growing exponentially, traditional general-purpose networks are reaching their breaking points. To fuel the next decade of machine learning, Google designed Virgo Network, a new megascale AI data center fabric that embraces a "campus-as-a-computer" philosophy, and that underpins our AI Hypercomputer.

Legacy network designs simply cannot handle some of the constraints of modern AI:

Massive scale: Training demands now exceed the power and space of a single data center, requiring unified, multi-data-center domains.
Explosive bandwidth growth: Because foundational model training is heavily network-bound, the required bandwidth per accelerator has surged significantly over the last few years, creating throughput and congestion bottlenecks for older architectures.
Synchronized bursts: Intense, millisecond-level traffic spikes (figure 1) put immense pressure on network buffers. The outcome is that even a single "straggler" node can throttle the entire cluster’s performance.
Low latency: ML serving requires fast, consistent response times to deliver real-time inference, making strict latency control a critical architectural constraint.

Figure 1: Sub-millisecond line-rate bursts of an AI training workload

Reimagining the data center network

Meeting the demands of the AI era requires a fundamental shift away from general-purpose network design towards a specialized flat, low-latency network architecture. To address the unique scale and latency constraints, we leverage our proven Jupiter network for north-south traffic and are introducing a new fabric for east-west communication. The resulting architecture consists of three distinct and specialized layers that operate as one unified compute domain:

Scale-up domain: A high-bandwidth, low-latency interconnect fabric designed for tightly coupled communication between accelerators within a single pod.
Scale-out accelerator fabric (east-west): A dedicated accelerator-to-accelerator remote direct memory access (RDMA) fabric optimized for massive horizontal scale across pods. This layer is engineered for deterministic latency and maximum resilience, to provide high “goodput” for the ML workload.
Jupiter front-end network (north-south): A high-capacity fabric that provides fast, reliable access to distributed storage and general-purpose compute resources. It ensures that data access does not become a bottleneck for training and serving workloads, and is also used to scale-across multiple sites for very large training runs.

This architectural decoupling provides key strategic advantages:

Independent evolution: We can evolve and upgrade each network domain independently, preventing system-wide disruptions while accelerating the innovation cycle.
Dedicated scale-out bandwidth: A non-blocking network delivers massive bisectional bandwidth to accelerators for critical training tasks.
ML and network co-design: The network is built in lockstep with each new generation of ML accelerators, helping ensure the fabric is matched to the hardware it supports.

Figure 2: Data center network architecture

Introducing Virgo Network: Megascale data center fabric

Virgo Network is a scale-out fabric designed for the extreme requirements of modern AI workloads. Built on high-radix switches that reduce network layers by allowing more ports per switch, it employs a flat, two-layer non-blocking topology. Compared with traditional datacenter networks, this significantly reduces latency by minimizing network tiers. It features a multi-planar design with independent control domains to connect accelerators (figure 3). The accelerator racks also connect with the Jupiter north-south fabric to access compute and storage services. Together, this streamlined architecture delivers the massive bisection bandwidth and deterministic low latency necessary for both distributed training and serving workloads.

Figure 3: Megascale data center fabric (Virgo Network)

Virgo Network is the foundation of our next-generation accelerator designs and delivers the following advantages:

Massive fabric scale: Virgo Network can link 134,000 chips (TPU 8t) with up to 47 petabits/sec of non-blocking bi-sectional bandwidth in a single fabric.
Generational performance leap: With up to 4x the bandwidth per accelerator (TPU 8t) over the previous generation, Virgo Network delivers the bandwidth you need to get the full power of every chip.
Predictable low latency: Virgo Network delivers 40% lower unloaded fabric latency for TPUs compared to previous generation leading to more predictable performance for latency sensitive AI workloads.

Improving reliability at scale

In a system supporting hundreds of thousands of chips, hardware failures are inevitable. Because a single faulty component can disrupt a synchronized training job, reliability at scale is a primary focus. To maximize workload goodput, we designed the Virgo Network architecture around fault isolation, deep observability, and the rapid mitigation of hangs and stragglers.

At this scale, system-wide resilience requires a solid network foundation. Virgo Network integrates independent switching planes that provide robust fault isolation, protecting cluster-wide goodput from being degraded by localized hardware failures.

Figure 4: How fail-stop and fail-slow impact MTTR

Building on this foundation, we optimize the software and orchestration stack to maximize mean-time between interruptions (MTBI) and minimize mean-time to recovery (MTTR) through two primary areas:

Observability: Reliability at scale requires high-fidelity visibility. We use sub-millisecond telemetry to monitor network systems. This deep visibility allows us to detect transient congestion, optimize buffer management, and pinpoint the root causes of slowdowns across the hardware and software stack.
Identifying stragglers and hangs: Proactive monitoring is critical for identifying nodes that are experiencing performance degradation (stragglers) or that have stopped responding completely (hangs). By rapidly localizing these bottlenecks, with automated straggler and newly added hang detection, we accelerate the training job and protect it from localized slowdowns.

The foundation of the AI Hypercomputer

Virgo Network is a reimagined scale-out data center network custom-built for the stringent demands of modern AI workloads. This flat, multi-planar architecture unifies accelerators across pods into a single compute domain, addressing the bandwidth and scale limitations of traditional networks. By providing robust fault isolation directly at the hardware level, Virgo Network serves as the foundation for system-wide resilience, protecting synchronized workloads from localized hardware faults.

Ultimately, Virgo Network delivers the scale, predictable latency, and reliability necessary to accelerate the agentic AI era. To learn more about how we are building infrastructure for the future of AI, visit our AI infrastructure solutions page, explore the technical documentation, or attend the dedicated breakout session at Google Cloud Next.

Next ‘26: Redefining security for the AI era with Google Cloud and Wiz

Wed, 22 Apr 2026 12:00:00 +0000

aside_block: <ListValue: [StructValue([('title', 'Our news today from Next ‘26'), ('body', <wagtail.rich_text.RichText object at 0x7fb064d38c70>), ('btn_text', ''), ('href', ''), ('image', None)])]>

The AI era demands a new security era. Organizations are facing the dual challenge of harnessing the potential of AI while defending against its malicious use, and Google Cloud can help you adapt and thrive.

The latest research from Google Cloud shows that adversaries are using AI to accelerate the speed, scale, and sophistication of attacks. Meanwhile, M-Trends 2026 also showed that increased threat actor coordination has driven down the time to hand-off from an initial access to a secondary threat actor from eight hours to 22 seconds in the last three years.

Today at Google Cloud Next, we are showcasing how Google Cloud can help you defend against increasingly sophisticated threats at machine speed, protect AI and multicloud environments, and secure cloud workloads at scale.

Delivering agentic defense

Our full-stack AI approach, from the chips to the models, gives you a competitive advantage with better integration and velocity to help protect customers. Not only can Google action insights from the world’s largest threat observatory and Mandiant frontline experts, but we also bring cutting-edge insights and breakthroughs from Google DeepMind, to help make your platforms more secure.

Today we are introducing three new agents in Google Security Operations to help you defend at the speed of AI.

Threat Hunting agent, now in preview, can help teams proactively hunt for novel attack patterns and stealthy adversary behaviors that bypass traditional defenses.
Detection Engineering agent, now in preview, can identify coverage gaps and create new detections for threat scenarios, reducing toil and transforming detection creation from a manual craft into an automated science.
Third-Party Context agent, coming soon to preview, can enrich your workflows with contextual data from third-party content.

Initiating a threat hunt with the Threat Hunting agent

Our Triage and Investigation agent processed over 5 million alerts in the last year, reducing a typical 30-minute manual analysis to 60 seconds with Gemini.

“Operational resilience and cybersecurity are the bedrock of customer trust at BBVA. By integrating advanced artificial intelligence, such as the Triage and Investigation agent, we are able to scale in new ways," said Diego Martinez Blanco, head of Security Technology, BBVA.

“It handles the initial heavy lifting and filters out false positives so we can prioritize issues that require human attention. The agent's transparent explanations allow our team to understand recommendations and ultimately dedicate our resources to more complex investigations,” he said.

You can build your own security agents with remote Google Cloud model context protocol (MCP) server support for Google Security Operations, now generally available. To make it even easier, you can also access the MCP server client directly from the Google Security Operations chat interface, available in preview.

Organizations leveraging an intelligence-led, AI-augmented approach to modern security operations with Google Cloud's agentic defense can realize a strong ROI. Christopher Kissel
Research Vice President, IDC

Findings report created by the Threat Hunting agent

Security teams can also automate response actions with agentic automation in Google Security Operations. To further move teams from manual triage to agentic defense, we introduced dark web intelligence in Google Threat Intelligence, now in preview. Internal tests show it can analyze millions of daily external events with 98% accuracy to elevate threats that truly matter.

"IDC found that organizations experienced measurable operational gains, including substantial reductions in mean time to detect and mean time to respond, fewer false positives, and higher analyst productivity with AI-powered context and automation. These operational improvements translate into significant business outcomes, such as shorter disruption periods, lower incident-related costs, and improved executive confidence in security posture and decision-making," said Christopher Kissel, research vice president, IDC. "Organizations leveraging an intelligence-led, AI-augmented approach to modern security operations with Google Cloud's agentic defense can realize a strong ROI."

New partner-supported workflows for Google Security Operations

Today, we are also announcing a robust cohort of new partner integrations for Google Security Operations. Designed to deliver high-fidelity security workflows right out of the box, our latest participating Google Cloud Security integration ecosystem partners include Darktrace, Gigamon, and SAP.

Protecting AI and cloud applications across any infrastructure

AI and cloud applications are built across multiple platforms and models. To protect them end-to-end, we want to make it easier and faster to mitigate risk, regardless of where and how you build. This support includes major cloud environments like Amazon Web Services, Google Cloud, Microsoft Azure, and Oracle Cloud; software-as-a-service (SaaS) environments like OpenAI; and even custom hosted environments.

Wiz, now a part of Google Cloud, expands and deepens our ability to protect the apps you build and run. Wiz empowers you to quickly and securely adopt AI, while also helping protect the AI development lifecycle.

Wiz announced its AI-Application Protection Platform (AI-APP) at the RSA Conference, providing deep visibility, risk posture, and runtime analysis for your AI applications. Wiz also announced Wiz Security Agents and Wiz Workflows, helping you identify and respond to risks and threats at machine speed.

Today, we’re taking our commitment to secure customers in any cloud, platform, and AI environment further. Wiz now supports Databricks as well as new agent studios like AWS Agentcore, Gemini Enterprise Agent Platform, Microsoft Azure Copilot Studio, and Salesforce Agentforce, so customers gain visibility however their teams choose to build.

In addition, Wiz continues to support security ecosystems with integrations to the outer layer of the cloud, including Google Cloud Apigee, Cloudflare AI Security for Apps, and the Vercel platform, further extending the power of the Wiz Security Graph. We’ve also updated how we integrate security detections from Wiz Defend with Google Security Operations and Mandiant Threat Defense to help analysts more easily configure automatic threat information forwarding.

Wiz is also announcing new capabilities designed to secure the AI-native development lifecycle, helping teams to innovate faster and more securely:

Secure vibe-coded applications: Wiz is announcing a new integration, generally available in May, that runs Wiz security scanning directly inside the Lovable platform so vulnerabilities, secrets, and misconfigurations caught by Wiz surface in Lovable's built-in security view, right where teams are already building.
Secure AI-generated code: Wiz removes risks from AI-generated code the moment it is created. Inline AI security hooks integrate directly into IDEs and agent workflows to evaluate prompts and scan AI-generated output instantly, injecting security guardrails before the code is ever committed.
Agent-based remediation: Wiz Skills equip coding agents and AI-native IDEs with full code-to-cloud context and validated attack surface findings from the Wiz Security Graph. These capabilities enable teams to trigger automated, agent-driven remediation workflows either locally from the developer's individual IDE or globally at the repository and pull request level within your version control system.
Eliminate shadow AI: Wiz’s dynamic AI-Bill of Materials (AI-BOM) automatically inventories all AI frameworks, models, and IDE extensions across your environment. This provides complete visibility into what is writing code across your stack, allowing you to track sanctioned corporate tools like Gemini Code Assist and GitHub Copilot while simultaneously uncovering unapproved shadow AI plugins.

You can learn more about the Wiz announcements here.

Securing your agents and the agentic web

In addition to securing your cloud and AI workloads, Google Cloud’s secure-by-design foundation can help you innovate at the speed of AI — from agents to fraud defense to the web.

Securing and governing agents with the Gemini Enterprise Agent Platform
To build, orchestrate, govern, and optimize agents, today we are announcing Gemini Enterprise Agent Platform including:

Agent Identity to enable access management and AI governance at scale. Our new capability provides agents unique identities to operate autonomously with specific authentication flows, and with scoped human delegation.
Agent Gateway, which enables policy enforcement for all agent-to-agent and agent-to-tool connections. It governs your enterprise agent traffic and understands agent protocols like MCP and Agent2Agent (A2A) to inspect and secure every agent interaction.
Model Armor, our runtime protection for model and agent interactions, now integrates with Agent Gateway, Agent Runtime, and Langchain available in preview, and Firebase, generally available, to help developers add inline enforcement and sanitization of agent traffic and interactions without the need to change code. These integrations expand Model Armor's protection against runtime risks such as prompt injections, tool poisoning, and sensitive data leakage across Google Cloud services and our AI portfolio.

Securing the agentic web with Google Cloud Fraud Defense and Chrome Enterprise
Today, we are evolving reCAPTCHA with the launch of Google Cloud Fraud Defense, generally available. This comprehensive platform is designed to discern the legitimacy and authorization of bots, humans, and agents. Using the same scale and signals that protect Google’s own ecosystem, Fraud Defense will soon offer in preview agent-specific capabilities for human users and AI agents that can help secure the digital commerce journey, from account creation and login to payment and checkout.

Our commitment to securing AI extends to the browser, a vital endpoint for interacting with AI. Chrome Enterprise provides comprehensive data protection for the AI era with the visibility and controls needed to embrace AI safely without compromising corporate data:

AI-aware extension threat detections, now in preview, can surface advanced extension telemetry that helps security teams detect and respond to anomalous AI agent activity.
New shadow AI reporting, generally available soon, can help you gain visibility into the shadow AI landscape by flagging employee use of unsanctioned web-based AI and SaaS applications.

What’s new in Trusted Cloud

We continue to offer new security controls and enhance capabilities across identity, data, and networking on our cloud platform to help you secure your environments. Today we’re announcing the following updates:

Simplifying permissions with modern IAM
To help achieve least privilege quickly and simply, we’ve streamlined our predefined roles catalog with easy-to-use administrator, editor, and viewer roles, such as the IAM role picker and the ability to re-authenticate sensitive actions.

Data security
We are announcing several new capabilities for our cloud platform data security portfolio to help protect your most sensitive data and accelerate AI transformation.

Confidential Computing: In partnership with NVIDIA, today we’re announcing Confidential Computing support for G4 VMs, featuring NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs on Google Compute Engine (GCE) Confidential G4 VMs, available in preview globally, to help strengthen confidentiality and integrity for a wide spectrum of sensitive AI workloads. In partnership with Intel, we’re also introducing the preview of C4 Confidential VMs, bringing Intel TDX to 6th Gen Xeon processors to help protect diverse AI and analytics workloads while providing industry-leading compute density and performance.
Cloud Key Management Services (KMS): We are announcing the new Confidential External Key Manager (cEKM) in preview, giving you the flexibility to host and protect external keys in any region and maintain verifiable control within a confidential environment.
Post-quantum cryptography (PQC): We are introducing KMS Quantum Safe Key Imports, available in preview, to help you bring your own keys with quantum-safe algorithms.
Secret Manager: To help prevent password leaks and mitigate prompt injection risks, we are announcing the general availability of the native integration of our Secret Manager with Agent Development Kit.

Network security
Google Cloud’s Cross-Cloud Network security products offer several new capabilities:

Cloud NGFW: We’re announcing the Cloud NGFW advanced malware sandbox, in preview later this year, to help defend against highly evasive zero-day threats. This capability is powered by Palo Alto Networks Advanced Wildfire, trained on data from more than 70,000 Palo Alto Networks customers to stop 99% of known and unknown malware.
Cloud Armor: We have released new Cloud Armor managed rules, powered by Thales Imperva and available in preview, to detect Layer 7 application attacks and zero-day CVEs (like React2Shell).

Advancing Google Cloud security with SCC
As our Google Cloud-native security solution, Security Command Center (SCC) establishes a cloud security baseline to protect both your traditional and AI applications on Google Cloud:

AI agents, models, and MCP servers are secured by providing continuous discovery and comprehensive risk analysis to identify threats, vulnerabilities, and misconfigurations.
SCC will add deep runtime visibility to uncover shadow AI for your Google Cloud workloads. Coming soon in preview, SCC will automatically discover unmanaged agentic workloads — including agents, MCP servers hosted on Cloud Run, GKE, and inference endpoints running on GKE, and surface those as posture findings in SCC.
Our enhanced Security Command Center Standard tier provides data security posture management, compliance, vulnerability management, and risk analysis to help any Google Cloud customer establish strong security, compliance and risk coverage from the start at no additional costs.

Take the next step

When you make Google part of your security team, you gain the power of an intelligence-driven, AI-native defense; the freedom of an open cloud that’s secure-by-design; and the industry's most-battle tested experts as an extension of your organization.

For more on these new innovations and how you can secure what’s next, tune in to watch our security spotlight. And be sure to check out the many great security breakout sessions — live and on-demand — to learn more about all of our Next ‘26 announcements.

Cross-cloud infrastructure innovation for the agentic enterprise

Wed, 22 Apr 2026 12:00:00 +0000

The era of agentic AI is accelerating from human- to machine-speed operations, while also creating profound stress on legacy technology infrastructure. This new reality pushes foundational systems to their limits: agents generate thousands of internal messages and complex queries, spawning more agents, all of which can rapidly overwhelm traditional networks and databases, and expose new security vulnerabilities.

Unlocking AI's full potential in the era of agents requires a secure, adaptive foundation. We call it cross-cloud infrastructure for the agentic enterprise – and at Google Cloud Next ‘26, we’re launching a powerful set of new innovations across four areas:

What’s new:

Fluid compute: Google Compute Engine and Kubernetes services work together to enable cost-effective, high-speed AI agents and enterprise workloads with new compute and orchestration capabilities.
Secure cross-cloud connectivity: Agent Gateway, Cloud Armor, and other tools deliver a secure, governed, and simplified networking foundation for AI agents, including observability of agentic traffic across clouds.
Unified data layer: Smart Storage, Knowledge Catalog, and other innovations transform passive data archives into dynamic reasoning engines, giving AI agents the context they need to execute.
Digital sovereignty: Confidential External Key Management and new features in Google Distributed Cloud bring Google’s leading models and AI enablers wherever your data lives.

Let’s take a closer look at all the news for each of these four areas.

Fluid compute

Agentic workloads are dynamic and unpredictable, impacting both traditional enterprise applications and the AI agents themselves. Fluid compute is enabled by Google Compute Engine and Google Kubernetes services working together to dynamically adapt and shift weight in real-time, enabling cost-effective, high-speed AI agents and operational enterprise workloads for all customers.

While our AI Hypercomputer delivers raw power for large-scale AI model training, fluid compute addresses the needs of operational workloads and agents. As agents move toward reasoning and reinforcement learning, CPUs are reclaiming a central role, excelling at the "branchy" logic, complex control flows, and secure execution sandboxes (like those for agentic orchestration, RL, SLM inference, and RAG) that agent workflows demand. CPUs also provide the critical isolation needed for secure agent execution, complementing the parallel processing strength of GPUs and TPUs used in training.

We are introducing new CPU families, GKE capabilities, and Hyperdisk block storage capabilities to run traditional workloads and AI agents securely at scale, including:

Google C4N Series: These VMs help ensure your enterprise workloads don't slow down under the demands of agentic AI by processing up to 95 million packets per second, up to 40% faster than other leading hyperscalers. This eliminates I/O bottlenecks for demanding workloads like security appliances, streaming media, and open source databases, even when utilizing smaller instance sizes.
Google M4N Series with Hyperdisk Extreme: M4N removes data pipeline bottlenecks and eliminates overprovisioning to deliver industry leading per-core IOPS and throughput required to handle massive data I/O from agents, analytics, and mission-critical databases. M4N provides 26.57 GB of RAM per vCPU, allowing you to scale mission-critical workloads cost-effectively on fewer cores. For example, M4N with Hyperdisk Extreme reduces Oracle workload total cost of ownership by over 20% compared to leading hyperscale clouds.
GKE Agent Sandbox: This solution secures agents with trusted gVisor isolation and handles demand spikes, launching up to 300 sandboxes per second, per cluster. Backed by the only managed sandbox technology available among leading hyperscale clouds, it achieves up to 30% better price-performance than competitors when running AI agents on GKE Agent Sandbox with Google Axion N4A.

“Wayfair's AI strategy is built on years of systematic infrastructure modernization on Google Cloud — migrating our core eCommerce engine and databases off legacy systems, decomposing monolithic services into cloud-native architecture, and unifying our data and analytics platform. That foundation is what makes everything else possible. Today, Gemini Enterprise Agent Platform is powering everything from catalog enrichment to generative shopping experiences that help customers create a home that's just right for them — and it's the same foundation preparing us for the agentic era, where AI doesn't just assist but actively drives discovery, personalization, and commerce across every customer touchpoint and across our business.” - Fiona Tan, Chief Technology Officer, Wayfair

Explore all our latest compute innovations in this blog.

Secure cross-cloud connectivity

Agentic AI replaces predictable human requests with autonomous “reasoning loops,” in which agents call other agents that, in turn, call LLMs, triggering massive, sudden surges in compute and machine-to-machine traffic. This shift creates unique challenges for network predictability and security of non-human identities. Optimized for agentic AI, our Cross-Cloud Network moves data across diverse environments, connecting employees, customers, and agents with visibility and security. New in Cross-Cloud Network are:

Agent Gateway: Governs and orchestrates your enterprise agentic traffic as the “air traffic controller” in Gemini Enterprise Agent Platform. It natively understands agent protocols like MCP and A2A to inspect and govern every agent interaction. By integrating with Google and third-party identity and AI safety services, it enables deep inspection to verify access, block attacks, and protect sensitive data, maintaining compliance across your core business.
Cloud Network Insights: Delivers broad visibility across your hybrid and multi-cloud infrastructure to drive faster troubleshooting and network resolutions. Continuously monitor your end-to-end agent, network and web performance across Google Cloud, AWS, Azure, data centers, internet applications, and agentic workloads. Using synthetic traffic analytics, Cloud Network Insights provides hop-by-hop network path visibility to help you pinpoint the source of degradations and is coupled with AI-powered insights from Gemini Cloud Assist to deliver more autonomous operations.
Enhanced Cloud Next Generation Firewall (NGFW) and Cloud Armor: Provides machine-speed, AI-powered protection to combat the rapid explosion of AI-generated polymorphic malware and zero-day exploits. Cloud NGFW advanced malware sandbox delivers real-time inline prevention of AI-generated threats, while Cloud Armor managed rules provides automated protection against both known and unknown Common Vulnerabilities and Exposures (CVEs). Together with Model Armor, these services analyze the intent and content of AI agent communications.

Discover more about how we optimized networking for AI in and outside of the data center.

Unified data layer

AI agents are only as powerful as the data they can access and the context they’re given. More applications and platforms are using structured and unstructured data, but it can be difficult to catalog, find, and act on that data at scale, leading to less effective agent interactions. To close the gap, your agents need all of your data brought together into a cohesive, queryable knowledge engine, or unified data layer. This way, your agents can identify and access accurate sources. At Next ‘26, we’re enhancing the unified data layer with:

Smart Storage: This solution transforms dark data into a powerful knowledge asset for AI agents and training by embedding new semantic intelligence directly into your data objects. With new Google Cloud Storage capabilities like automated annotation, entity extraction, and semantic search, your agents can instantly find and use the specific data they need — whether it's hidden in spreadsheets, PDFs, or other unstructured formats across your entire organization. This significantly speeds up the development and deployment of your AI solutions. Learn more about storage innovations to accelerate your AI workloads.
Knowledge Catalog: Knowledge Catalog maps business meaning across your entire data estate, providing a grounded source of truth so agents can deliver the most accurate results. This foundation enables AI training and inferencing and doesn’t require you to migrate your data; your agents interact with it directly, wherever it lives, with full context and governance, making modernization easier.

Part of our Agentic Data Cloud, Smart Storage and Knowledge Catalog can take your data from a passive archive into a dynamic reasoning engine.

“AI is critical to making our customers’ smart home and security solutions more intelligent and convenient. By leveraging Google Cloud’s Smart Storage, we auto-annotate rich metadata delivered in BigQuery. We’ve scaled and accelerated our data discovery and curation efforts, speeding up our AI development process from months to weeks, continuously delivering innovations that build trust and enhance the overall home experience.” - Brandon Bunker, VP of Product, AI, Vivint

Digital sovereignty

In the agentic era, digital sovereignty is a fundamental requirement for public sector and enterprise customers looking to accelerate innovation — without sacrificing control. There’s no one-size-fits-all solution, which is why we’ve designed a comprehensive set of offerings to meet different sovereign AI needs anywhere: public cloud, on-premises, or hybrid. New capabilities in our sovereign AI portfolio include:

Confidential External Key Management: Organizations can use Confidential External Key Management to maintain complete possession, custody, and control of your encryption keys and the policies that govern them. Confidential External Key Management leverages Confidential Compute to host the key management endpoint in a tamper-proof environment within Google Cloud. You are in control and determine where your keys are stored, who can access them, and under what circumstances. Even highly privileged Google administrators cannot access your keys without authorization, which you can revoke at any time. Your data, your control.
Gemini on Google Distributed Cloud: With Gemini on GDC, companies can securely deploy Gemini in sensitive environments, while meeting data sovereignty needs. Your choice of deployment models includes managed software on your connected hardware or a fully disconnected, air-gapped solution. You can now scale with Google's leading AI capabilities even in the most restricted, high-security environments — from powerful Gemini models to advanced coding, search, and other agentic capabilities.

In addition, Google Distributed Cloud supports an end-to-end AI stack, combining our latest-generation AI infrastructure with Gemini models to accelerate and enhance all your sovereign AI workloads. This stack includes:

NVIDIA Blackwell GPUs: NVIDIA Blackwell (NVIDIA HGX B200) and NVIDIA Blackwell Ultra platforms (NVIDIA HGX B300) GPUs accelerate AI performance, leveraging fifth-gen NVIDIA NVLink to deliver data-center scale bandwidth directly to your environment.
New VM families: New A4 family offerings provide the ability to handle the most demanding inference tasks, delivering a 2.25x increase in peak compute. Memory-Optimized M2 and M3 brings the high memory-to-vCPU ratios needed for massive ERP and data analytics workloads on-premises.
Enhanced storage: Eliminate storage bottlenecks with 6x storage capacity per zone and a 10x performance boost, giving you the ability to do AI reasoning on-premises. Now, your data infrastructure moves at the speed of AI reasoning.

"Our customers demand high-performance, private AI inference without the risks of multi-tenancy. Google Distributed Cloud allows us to provide dedicated, low-latency environments that meet strict sensitive data requirements. With the ability to run Gemini on B200s and B300s, we can significantly increase inference speeds and provide the token throughput our clients need to scale." - Dave Driggers, CEO & Co-founder, Cirrascale Cloud Services

Transforming vision into reality

When these product areas converge, your infrastructure evolves into a high-performing, secure, adaptive foundation for the agentic era. We're not just offering tools; we're providing the architectural blueprint to enable enterprises and the public sector to rapidly embrace the full power of AI and agents with confidence.

To learn more about key industry trends for AI Infrastructure, read our State of Infrastructure in the Agentic AI Era report.

Evolving Media CDN for the world’s most demanding broadcast and streaming workloads

Fri, 17 Apr 2026 17:30:00 +0000

Editor’s note: In this post, we share joint insights from Raj Gulani, Director of Product Management for Network Experiences, and Dan Rayburn, Industry analyst with 30-plus years of experience covering streaming media.

In our combined experience observing and building within the media industry, one truth remains constant: the landscape is always evolving. Audience expectations for flawless, broadcast-quality streaming have become the undisputed baseline, while the scale of global live events continues to push the technical boundaries of content delivery.

From our shared perspective, the most successful platforms are no longer defined solely by their ability to handle massive scale. Instead, they are distinguished by their evolution — how they adapt to solve the complex operational and financial challenges that broadcasters and streaming services face every day. This post offers a joint look at some of these key industry demands and how platforms are innovating to meet them.

The need for scale, flexibility, and efficiency

The need to support massive audiences during live global events like the Super Bowl, FIFA World Cup, and IPL is a given. In response to this clear industry trend, content delivery networks (CDN) must continuously scale their infrastructure to support peak traffic demand. We’ve seen this firsthand with Google Cloud’s Media CDN, which shares infrastructure with Youtube, has had to actively respond to customer capacity needs with infrastructure presence in relevant regions, especially for live events.

Beyond raw capacity, however, a more nuanced story is unfolding around the need for greater architectural flexibility and more predictable cost models. We believe the focus has rightly shifted to providing smarter tools that help manage traffic, improve performance, and control costs. Here are a few examples of this:

Flexible caching architectures: One of the key challenges in global delivery is minimizing latency and cost. The introduction of features like flexible shielding – supported today in South Africa, the Middle East, and the US – is a direct answer to this. Such features allow traffic to be managed within a region, avoiding the performance and cost penalties of fetching content from a distant origin.
Solving for interoperability: As workflows become more complex, platforms need to be better integrated. We have seen a focus on addressing common origin compatibility issues through tactical engineering solutions. Examples include adding support for HEAD requests, increasing maximum segment sizes to 25MiB to accommodate 4K/8K content, and enabling multi-part range requests. These kinds of updates are crucial for ensuring a platform works with a customer’s existing infrastructure, not against it.
The shift to predictable cost models: In a maturing industry, operators need financial predictability. The move toward offering monthly savings plans, which provide TCO benefits for a committed level of use, is an important step beyond pure pay-as-you-go pricing models.
The critical need for broadcast-grade visibility: In our analysis of streaming operations, a lack of real-time visibility is a recurring point of failure. For a major live event, customers cannot wait for next business day response times and require more immediate intervention to ensure the live event runs flawlessly — it’s a fundamental requirement. The use of tools like monitoring as a service (MaaS) during major live events highlights the industry's shift toward proactive, data-driven operations. By providing a "broadcast operating center" view into everything from origin health to end-user quality of service, such tools empower engineering teams to identify and mitigate potential problems before they impact the audience.
A shared outlook on the future: The evolution of content delivery platforms is a clear indicator of the media industry's priorities. The focus is increasingly on providing data-driven scaling, sophisticated operational tooling, and tangible architectural and financial benefits. This move toward solving specific, complex challenges demonstrates a maturing market, and it’s a direction we both believe is critical for the future of broadcasting and streaming.

For technical leaders looking to benchmark their current infrastructure against these trends, exploring modern edge architectures is a logical next step. You can learn more about implementing flexible caching and broadcast grade visibility by visiting the Media CDN documentation.

Migrating to Google Cloud’s Application Load Balancer: A practical guide

Fri, 10 Apr 2026 16:00:00 +0000

Migrating your existing application load balancer infrastructure from an on-premises hardware solution to Cloud Load Balancing offers substantial advantages in scalability, cost-efficiency, and tight integration within the Google Cloud ecosystem. Yet, a fundamental question often arises: "What about our current load balancer configurations?"

Existing on-premises load balancer configurations often contain years of business-critical logic for traffic manipulation. The good news is that not only can you fully migrate existing functionalities, but this migration also presents a significant opportunity to modernize and simplify your traffic management.

This guide outlines a practical approach for migrating your existing load balancer to Google Cloud’s Application Load Balancer. It addresses common functionalities, leveraging both its declarative configurations and the innovative, event-driven Service Extensions edge compute capability.

A simple, phased approach to migration

Transitioning from an imperative, script-based system to a cloud-native, declarative-first model requires a structured plan. We recommend a straightforward, four-phase approach.

Phase 1: Discovery and mapping

Before commencing any migration, you must understand what you have. Analyze and categorize your current load balancer configurations. What is each rule's intent? Is it performing a simple HTTP-to-HTTPS redirect? Is it engaged in HTTP header manipulation (addition or removal)? Or is it handling complex, custom authentication logic?

Most configurations typically fall into two primary categories:

Common patterns: Logic that is common to most web applications, such as redirects, URL rewrites, basic header manipulation, and IP-based access control lists (ACLs).
Bespoke business logic: Complex logic unique to your application, like custom proprietary token authentication, advanced header extraction / replacement, dynamic backend selection based on HTTP attributes, or HTTP response body manipulation.

Phase 2: Choose your Google Cloud equivalent

Once your rules are categorized, the next step involves mapping them to the appropriate Google Cloud feature. This is not a one-to-one replacement; it's a strategic choice.

Option 1: the declarative path (for ~80% of rules)
For the majority of common patterns, leveraging the Application Load Balancer's built-in declarative features is usually the best approach. Instead of a script, you define the desired state in a configuration file. This is simpler to manage, version-control, and scale.

Common patterns to declarative feature mapping:

Redirects/rewrites -> Application Load Balancer URL maps
ACLs/throttling -> Google Cloud Armor security policies
Session persistence -> backend service configuration

Option 2: The programmatic path (for complex, bespoke rules)
When dealing with complex, bespoke business logic, you have a programmatic equivalent: Service Extensions, a powerful edge compute capability that allows you to inject custom code (written in Rust, C++ or Go) directly into the load balancer's data path. This approach gives you flexibility in a modern, managed, and high-performance framework.

This flowchart helps you decide the appropriate Google Cloud feature for each configuration

Phase 3: Test and validate

Once you’ve chosen the appropriate path for your configurations, you are ready to deploy your new Application Load Balancer configuration in a staging environment that mirrors your production setup. Thoroughly test all application functionality, paying close attention to the migrated logic. Use a combination of automated testing and manual QA to validate the redirects, security policies, and that the custom Service Extensions logic are behaving as expected.

Phase 4: Phased cutover (canary deployment)

Don't flip a single switch for all your traffic; instead, implement a phased migration strategy. Start the transitioning process by routing a small percentage of production traffic (e.g., 5-10%) to your new Google Cloud load balancer. During this initial period, be sure to monitor key metrics like latency, error rates, and application performance. As you gain confidence, you can progressively increase the percentage of traffic routed to the Application Load Balancer. Always have a clear rollback plan to revert back to the legacy infrastructure in the event you encounter critical issues.

Best practices for a smooth migration

Drawing from our practical experience, we have compiled the following recommendations to assist you in planning your load balancer migrations.

Analyze first, migrate second: A thorough analysis of your existing configurations is the most critical step. Don't "lift and shift" logic that is no longer needed.
Prefer declarative: Always default to Google Cloud's managed, declarative features (URL Maps, Cloud Armor) first. They are simpler, more scalable, and require less maintenance.
Use Service Extensions strategically: Reserve Service Extensions for the complex, bespoke business logic that declarative features cannot handle.
Monitor everything: Continuously monitor both your existing load balancers and Google Cloud load balancers during the migration. Watch key metrics like traffic volume, latency, and error rates to detect and address issues instantly.
Train your team: Ensure your team is trained on Cloud Load Balancing concepts. This will empower them to effectively operate and maintain the new infrastructure.

Migrating from the existing on-premises load balancer infrastructure is more than just a technical task, it's an opportunity to modernize your application delivery. By thoughtfully mapping your current load balancing configurations and capabilities to either declarative Application Load Balancer features or programmatic Service Extensions, you can build a more scalable, resilient, and cost-effective infrastructure destined for future demands.

To get started, review the Application Load Balancer and Service Extensions features and advanced capabilities to come up with the right design for your application. For more guidance and complex use cases, contact your Google Cloud team.

Experimenting with GPUs: GKE managed DRANET and Inference Gateway AI Deployment

Wed, 08 Apr 2026 10:05:00 +0000

Building and serving models on infrastructure is a strong use case for businesses. In Google Cloud, you have the ability to design your AI infrastructure to suit your workloads. Recently, I experimented with Google Kubernetes Engine (GKE) managed DRANET while deploying a model for inference with NVIDIA B200 GPUs on GKE. In this blog, we will explore this setup in easy to follow steps.

What is DRANET

Dynamic Resource Allocation (DRA) is a feature that lets you request and share resources among Pods. DRANET allows you to request and allocate networking resources for your Pods, including network interfaces that support TPUs & Remote Direct Memory Access (RDMA). In my case, the use of high-end GPUs.

How GPU RDMA VPC works

The RDMA network is set up as an isolated VPC, which is regional and assigned a network profile type. In this case, the network profile type is RoCEv2. This VPC is dedicated for GPU-to-GPU communication. The GPU VM families have RDMA capable NICs that connect to the RDMA VPC. The GPUs communicate between multiple nodes via this low latency, high speed rail aligned setup.

Design pattern example

Our aim was to deploy a LLM model (Deepseek) onto a GKE cluster with A4 nodes that support 8 B200 GPUs and serve it via GKE Inference gateway privately. To set up an AI Hypercomputer GKE cluster, you can use the Cluster Toolkit, but in my case, I wanted to test the GKE managed DRANET dynamic setup of the networking that supports RDMA for the GPU communication.

This design utilizes the following services to provide an end-to-end solution:

VPC: Total of 3 VPC. One VPC manually created, two created automatically by GKE managed DRANET, one standard and one for RDMA.
GKE: To deploy the workload.
GKE Inference gateway: To expose the workload internally using a regional internal Application Load Balancers type gke-l7-rilb.
A4 VM’s: These support RoCEv2 with NVIDIA B200 GPU.

Putting it together

To get access to the A4 VM a future reservation was used. This is linked to a specific zone.

Begin: Set up the environment

Create a standard VPC, with firewall rules and subnet in the same zone as the reservation.
Create a proxy-only subnet this will be used with the Internal regional application load balancer attached to the GKE inference gateway

Next: Create a standard GKE cluster node and default node pool.

code_block: <ListValue: [StructValue([('code', 'gcloud container clusters create $CLUSTER_NAME \\\r\n --location=$ZONE \\\r\n --num-nodes=1 \\\r\n --machine-type=e2-standard-16 \\\r\n --network=${GVNIC_NETWORK_PREFIX}-main \\\r\n --subnetwork=${GVNIC_NETWORK_PREFIX}-sub \\\r\n --release-channel rapid \\\r\n --enable-dataplane-v2 \\\r\n --enable-ip-alias \\\r\n --addons=HttpLoadBalancing,RayOperator \\\r\n --gateway-api=standard \\\r\n --enable-ray-cluster-logging \\\r\n --enable-ray-cluster-monitoring \\\r\n --enable-managed-prometheus \\\r\n --enable-dataplane-v2-metrics \\\r\n --monitoring=SYSTEM'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb0643e5730>)])]>

Once that is complete you can connect to your cluster:

code_block: <ListValue: [StructValue([('code', 'gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb064f0f670>)])]>

Create a GPU node pool (this example uses, A4 VM with reservation) and additionals flags:

---accelerator-network-profile=auto (GKE automatically adds the gke.networks.io/accelerator-network-profile: auto label to the nodes)

--node-labels=cloud.google.com/gke-networking-dra-driver=true (Enables DRA for high-performance networking)

code_block: <ListValue: [StructValue([('code', 'gcloud beta container node-pools create $NODE_POOL_NAME \\\r\n --cluster $CLUSTER_NAME \\\r\n --location $ZONE \\\r\n --node-locations $ZONE \\\r\n --machine-type a4-highgpu-8g \\\r\n --accelerator type=nvidia-b200,count=8,gpu-driver-version=latest \\\r\n --enable-autoscaling --num-nodes=1 --total-min-nodes=1 --total-max-nodes=3 \\\r\n --reservation-affinity=specific \\\r\n--reservation=projects/$PROJECT/reservations/$RESERVATION_NAME/reservationBlocks/$BLOCK_NAME \\\r\n --accelerator-network-profile=auto \\\r\n--node-labels=cloud.google.com/gke-networking-dra-driver=true'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb064f0f2b0>)])]>

Next: Create a ResourceClaimTemplate, which will be used to attach the networking resources to your deployments. The deviceClassName: mrdma.google.com is used for GPU workloads:

code_block: <ListValue: [StructValue([('code', 'apiVersion: resource.k8s.io/v1\r\nkind: ResourceClaimTemplate\r\nmetadata:\r\n name: all-mrdma\r\nspec:\r\n spec:\r\n devices:\r\n requests:\r\n - name: req-mrdma\r\n exactly:\r\n deviceClassName: mrdma.google.com\r\n allocationMode: All'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb064f0f8b0>)])]>

Deploy model and inference

Now that a cluster and node pool is setup, we can deploy a model and serve it via Inference gateway. In my experiment I used DeepSeek but this could be any model.

Deploy model and services

The nodeSelector: gke.networks.io/accelerator-network-profile: auto is used to assign to the GPU node
The resourceClaims: attaches the resource we defined for networking

Create a secret (I used Hugging Face token):

code_block: <ListValue: [StructValue([('code', 'kubectl create secret generic hf-secret \\\r\n --from-literal=hf_token=${HF_TOKEN}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb064f0f2e0>)])]>

Deployment

code_block: <ListValue: [StructValue([('code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: deepseek-v3-1-deploy\r\nspec:\r\n replicas: 1\r\n selector:\r\n matchLabels:\r\n app: deepseek-v3-1\r\n template:\r\n metadata:\r\n labels:\r\n app: deepseek-v3-1\r\n ai.gke.io/model: deepseek-v3-1\r\n ai.gke.io/inference-server: vllm\r\n examples.ai.gke.io/source: user-guide\r\n spec:\r\n containers:\r\n - name: vllm-inference\r\n image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250819_0916_RC01\r\n resources:\r\n requests:\r\n cpu: "190"\r\n memory: "1800Gi"\r\n ephemeral-storage: "1Ti"\r\n nvidia.com/gpu: "8"\r\n limits:\r\n cpu: "190"\r\n memory: "1800Gi"\r\n ephemeral-storage: "1Ti"\r\n nvidia.com/gpu: "8"\r\n claims:\r\n - name: rdma-claim\r\n command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]\r\n args:\r\n - --model=$(MODEL_ID)\r\n - --tensor-parallel-size=8\r\n - --host=0.0.0.0\r\n - --port=8000\r\n - --max-model-len=32768\r\n - --max-num-seqs=32\r\n - --gpu-memory-utilization=0.90\r\n - --enable-chunked-prefill\r\n - --enforce-eager\r\n - --trust-remote-code\r\n env:\r\n - name: MODEL_ID\r\n value: deepseek-ai/DeepSeek-V3.1\r\n - name: HUGGING_FACE_HUB_TOKEN\r\n valueFrom:\r\n secretKeyRef:\r\n name: hf-secret\r\n key: hf_token\r\n volumeMounts:\r\n - mountPath: /dev/shm\r\n name: dshm\r\n livenessProbe:\r\n httpGet:\r\n path: /health\r\n port: 8000\r\n initialDelaySeconds: 1800\r\n periodSeconds: 10\r\n readinessProbe:\r\n httpGet:\r\n path: /health\r\n port: 8000\r\n initialDelaySeconds: 1800\r\n periodSeconds: 5\r\n volumes:\r\n - name: dshm\r\n emptyDir:\r\n medium: Memory\r\n nodeSelector:\r\n gke.networks.io/accelerator-network-profile: auto\r\n resourceClaims:\r\n - name: rdma-claim\r\n resourceClaimTemplateName: all-mrdma\r\n---\r\napiVersion: v1\r\nkind: Service\r\nmetadata:\r\n name: deepseek-v3-1-service\r\nspec:\r\n selector:\r\n app: deepseek-v3-1\r\n type: ClusterIP\r\n ports:\r\n - protocol: TCP\r\n port: 8000\r\n targetPort: 8000\r\n---\r\napiVersion: monitoring.googleapis.com/v1\r\nkind: PodMonitoring\r\nmetadata:\r\n name: deepseek-v3-1-monitoring\r\nspec:\r\n selector:\r\n matchLabels:\r\n app: deepseek-v3-1\r\n endpoints:\r\n - port: 8000\r\n path: /metrics\r\n interval: 30s'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb064f0fd60>)])]>

Deploy GKE Inference Gateway

This install needed Custom Resource Definitions (CRDs) in your GKE cluster:

For GKE versions 1.34.0-gke.1626000 or later, install only the alpha InferenceObjective CRD:

code_block: <ListValue: [StructValue([('code', 'kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb065a2f070>)])]>

Create Inference pool

code_block: <ListValue: [StructValue([('code', 'helm install deepseek-v3-pool \\\r\n oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \\\r\n --version v1.0.1 \\\r\n --set inferencePool.modelServers.matchLabels.app=deepseek-v3-1 \\\r\n --set provider.name=gke \\\r\n --set inferenceExtension.monitoring.gke.enabled=true'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb065a2f880>)])]>

Create the Gateway, HTTPRoute and InferenceObjective

code_block: <ListValue: [StructValue([('code', '# 1. The Regional Internal Gateway (ILB)\r\napiVersion: gateway.networking.k8s.io/v1\r\nkind: Gateway\r\nmetadata:\r\n name: deepseek-v3-gateway\r\n namespace: default\r\nspec:\r\n gatewayClassName: gke-l7-rilb\r\n listeners:\r\n - name: http\r\n protocol: HTTP\r\n port: 80\r\n allowedRoutes:\r\n namespaces:\r\n from: Same\r\n---\r\n# 2. The HTTPRoute (Routing to the Pool)\r\napiVersion: gateway.networking.k8s.io/v1\r\nkind: HTTPRoute\r\nmetadata:\r\n name: deepseek-v3-route\r\n namespace: default\r\nspec:\r\n parentRefs:\r\n - name: deepseek-v3-gateway\r\n rules:\r\n - matches:\r\n - path:\r\n type: PathPrefix\r\n value: /\r\n backendRefs:\r\n - group: inference.networking.k8s.io\r\n kind: InferencePool\r\n name: deepseek-v3-pool\r\n---\r\n# 3. The Inference Objective (Performance Logic)\r\napiVersion: inference.networking.x-k8s.io/v1alpha2\r\nkind: InferenceObjective\r\nmetadata:\r\n name: deepseek-v3-objective\r\n namespace: default\r\nspec:\r\n poolRef:\r\n name: deepseek-v3-pool'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb065a2f700>)])]>

Once complete, you can create a test VM in your main VPC and make a call to the IP address of the GKE Inference Gateway:

code_block: <ListValue: [StructValue([('code', 'curl -N -s -X POST "http://$GATEWAY_IP/v1/chat/completions" \\\r\n -H "Content-Type: application/json" \\\r\n -d \'{\r\n "model": "deepseek-ai/DeepSeek-V3.1",\r\n "messages": [{"role": "user", "content": "Box A: red. Box B: blue. Box C: empty. Move A to C, Move B to A, Swap B and C. Where is red?"}],\r\n "stream": true\r\n }\' | stdbuf -oL grep "data: " | sed -u \'s/^data: //\' | grep -v "\\[DONE\\]" | \\\r\n jq --unbuffered -rj \'.choices[0].delta | (.reasoning_content // .reasoning // .content // empty)\''), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb065a2fb20>)])]>

Next Steps

Take a deeper dive into GKE managed DRANET and GKE Inference Gateway, review the following.

Blog: DRA: A new era of Kubernetes device management with Dynamic Resource Allocation
Document set: DRANET
Documentation: AI Hypercomputer

Want to ask a question, find out more or share a thought? Please connect with me on Linkedin.

See beyond the IP and secure URLs with Google Cloud NGFW

Tue, 07 Apr 2026 17:30:00 +0000

In a cloud-first world, traditional IP-based defenses are no longer enough to protect your perimeter. As services migrate to shared infrastructure and content delivery networks, relying on static IP addresses and FQDNs can create security gaps.

Because single IP addresses can host multiple services, and IPs addresses can change frequently, we are introducing domain filtering with a wildcard capability in Cloud Next Generation Firewall (NGFW) Enterprise. This new capability provides increased security and granular policy controls.

Why domain and SNI filtering matters

The Cloud NGFW URL filtering service performs deep inspections of HTTP payloads to secure workloads against threats from both public and internal networks. This service elevates security controls to the application layer and helps restrict access to malicious domains.

Key use cases include:

Granular egress control: This capability enables the precise allowing and blocking of connections based on domain names and SNI information found in egress HTTP(S) messages. By inspecting Layer 7 (L7) headers, it offers significantly finer control than traditional filtering based solely on IP addresses and FQDNs, which can be inefficient when a single IP hosts multiple services.
Control access without decrypting: For organizations that prefer not to perform full TLS decryption on their traffic, Cloud NGFW can still enforce security policies by controlling traffic based on SNI headers provided during the TLS handshake. This allows for effective domain-level filtering while maintaining end-to-end encryption for privacy or compliance reasons.
Reduced operational overhead: Implementing domain-based filtering helps reduce the constant maintenance typically required to track frequently changing IP addresses and DNS records. By focusing on stable domain identities rather than dynamic network attributes, security teams can minimize the manual effort involved in updating firewall rulebases.
Flexible matching: The service utilizes matcher strings within URL lists, supporting limited wildcard domains to define criteria for both domains and subdomains. For example, using a wildcard like *.example.com allows a single filter to cover all associated subdomains, providing a more scalable solution than defining thousands of individual FQDN entries.
Improved security: URL filtering significantly enhances the security posture by protecting against sophisticated flaws like SNI header spoofing. By evaluating L7 headers before allowing access to an application, Cloud NGFW ensures that attackers cannot bypass security controls by simply spoofing lower-layer identifiers.

How Cloud NGFW URL filtering works

The URL filtering service functions by inspecting traffic at L7 using a distributed architecture.

Cloud NGFW URL filtering service

You can get started with URL filtering in three simple steps.

Deploy Cloud NGFW endpoints:

The first step is to create and deploy a Cloud NGFW endpoint in a zone. The NGFW endpoint is an organization level resource. Please ensure you have the right permission before deploying the endpoint.
Once the endpoint is deployed you can associate it to one or more VPCs of your choice.

Create security profiles and security profile groups:

The URL filtering security profile holds the URL filters with matcher strings and an action (allow or deny).
The security profile group acts as a container for these security profiles, which is then referenced by a firewall policy rule. Create URL filtering security profiles with desired URLs, wildcard FQDNs and add them to a security profile group.
Once the security profile group is created, you will need to reference the security profile group in firewall policies.

Policy enforcement:

You enable the service by configuring a hierarchical or global network firewall policy rule using the apply_security_profile_group action, specifying the name of your security profile group.

For more information about configuring a firewall policy rule, see the following:

Getting started

Get started with Cloud NGFW URL filtering by visiting our documentation and codelab.

Envoy: A future-ready foundation for agentic AI networking

Fri, 03 Apr 2026 16:00:00 +0000

In today's agentic AI environments, the network has a new set of responsibilities.

In a traditional application stack, the network mainly moves requests between services. But as discussed in a recent white paper, Cloud Infrastructure in the Agent-Native Era, in an agentic system the network sits in the middle of model calls, tool invocations, agent-to-agent interactions, and policy decisions that can shape what an agent is allowed to do. The rapid proliferation of agents, often built on diverse frameworks, necessitates a consistent enforcement of governance and security across all agentic paths at scale. To achieve this, the enforcement layer must shift from the application level to the underlying infrastructure. That means the network can no longer operate as a blind transport layer. It has to understand more, enforce better, and adapt faster. This shift is precisely where Envoy comes in.

As a high-performance distributed proxy and universal data plane, Envoy is built for massive scale. Trusted by demanding enterprise environments, including Google Cloud, it supports everything from single-service deployments to complex service meshes using Ingress, Egress, and Sidecar patterns. Because of its deep extensibility, robust policy integration, and operational maturity, Envoy is uniquely suited for an era where protocols change quickly and the cost of weak control is steep. For teams building agentic AI, Envoy is more than a concept: it's a practical, production-ready foundation.

Agentic AI changes the networking problem

Agentic workloads still often use HTTP as a transport, but they break some of the assumptions that traditional HTTP intermediaries rely on. Protocols such as Model Context Protocol (MCP) and Agent2agent (A2A) use JSON-RPC or gRPC over HTTP, adding protocol-level phases such as MCP initialization, where client and server exchange their capabilities, on top of standard HTTP request/response semantics. The key aspects of agentic systems that require intermediaries to adapt include:

Diverse enterprise governance imperatives. The primary challenge is satisfying the wide spectrum of non-negotiable enterprise requirements for safety, security, data privacy, and regulatory compliance. These needs often go beyond standard network policies and require deep integration with internal systems, custom logic, and the ability to rapidly adapt to new organizational rules or external regulations. This demands a highly extensible framework where enterprises can plug in their specific governance models.
Policy attributes live inside message bodies, not headers. Unlike traditional web traffic where policy inputs like paths and headers are readily accessible, agentic protocols frequently bury critical attributes (e.g., model names, tool calls, resource IDs) deep within JSON-RPC or gRPC payloads. This shift requires intermediaries to possess the ability to parse and understand message contents to apply context-aware policies.
Handling diverse and evolving protocol characteristics. Agentic protocols are not uniform. Some, like MCP with Streamable HTTP, can introduce stateful interactions requiring session management across distributed proxies (e.g., using Mcp-Session-Id). The need to support such varied behaviors, along with future protocol innovations, reinforces the necessity of an inherently adaptable and extensible networking foundation.

These factors mean enterprises need more than just connectivity. The network must now serve as a central point for enforcing the crucial governance needs mentioned earlier. This includes providing capabilities like centralized security, comprehensive auditability, fine-grained policy enforcement, and dynamic guardrails, all while keeping pace with the rapid evolution of protocols and agent behaviors. Put simply, agentic AI transforms the network from a mere transit path into a critical control point.

Why Envoy fits this shift

Envoy is a strong fit for agentic AI networking for three reasons. Envoy is:

Battle-tested. Enterprises already rely on Envoy in high-scale, security-sensitive environments, making it a credible platform to anchor a new generation of traffic management and policy enforcement.
Extensible. Envoy can be extended through native filters, Rust modules, WebAssembly (Wasm) modules, and external processing patterns. That gives platform teams room to adopt new protocols without having to rebuild their networking layer every time the ecosystem changes.
Operationally useful today. Envoy already acts as a gateway, enforcement point, observability layer, and integration surface for control planes. That makes it a practical choice for organizations that need to move now, not after the standards settle.

Building on these core strengths, Envoy has introduced specific architectural advancements to meet the unique demands of agentic networking:

1. Envoy understands agent traffic

The first requirement for agentic networking is simple: The gateway needs to understand what the agent is actually trying to do.

That’s harder than it sounds. In protocols such as MCP, A2A, and OpenAI-style APIs, important policy signals may live inside the request body. Traditional HTTP proxies are optimized to treat bodies as opaque byte streams. That design is efficient, but it limits what the proxy can enforce. For protocols that use JSON messages, a proxy may need to buffer the entire request body to locate attribute values needed for policy application — especially when those attributes appear at the end of the JSON message. Business logic specific to gen AI protocols, such as rate limiting based on consumed tokens, may also require parsing server responses.

Envoy addresses this by deframing protocol messages carried over HTTP and exposing useful attributes to the rest of the filter chain. The extensibility model for gen AI protocols was guided by two goals:

Easy reuse of existing HTTP extensions that work with gen AI protocols out of the box, such as RBAC or tracers.
Easy access to deframed messages for gen-AI-specific extensions, so that developers can focus on gen AI business logic without needing to deal with HTTP or JSON envelopes.

Based on these goals, new extensions for gen AI protocols are still built as HTTP extensions and configured in the HTTP filter chain. This provides flexibility to mix HTTP-native business logic, such as OAuth or mTLS authorization, with gen AI protocol logic in a single chain. A deframing extension parses the protocol messages carried by HTTP and provides an ambient context with extracted attributes, or even the entirety of parsed messages, to downstream extensions via well-known filter state and metadata values.

Instead of forcing every policy component to parse JSON envelopes or protocol-specific message formats on its own, Envoy makes those attributes available as structured metadata. Once the gateway has deframed protocol messages, existing Envoy extensions such as ext_authz or RBAC can read protocol properties to evaluate policies using protocol-specific attributes such as tool names for MCP, message attributes for A2A, or model names for OpenAI.

Access logs can include message attributes for enhanced monitoring and auditing. The protocol attributes are also available to the Common Expression Language (CEL) runtime, simplifying creation of complex policy expressions in RBAC or composite extensions.

Buffering and memory management
Envoy is designed to use as little memory as possible when proxying HTTP requests. However, parsing agentic protocols may require an arbitrary amount of buffer space, especially when extensions require the entire message to be in memory. The flexibility of allowing extensions to use larger buffers needs to be balanced with adequate protection from memory exhaustion, especially in the presence of untrusted traffic.

To achieve this, Envoy now provides a per-request buffer size limit. Buffers that hold request data are also integrated with the overload manager, enabling a full range of protective actions under memory pressure, such as reducing idle timeouts or resetting requests that consume the most memory for an extended duration. These changes pave the way for Envoy to serve as a gateway and policy-enforcement point for gen AI protocols without compromising its resource efficiency.

2. Envoy enforces policy on things that matter

Understanding traffic is only useful if the gateway can act on it.

In agentic systems, policy is not just about which service an agent can reach. It’s about which tools an agent can call, which models it can use, what identity it presents, how much it can consume, and what kinds of outputs require additional controls. Those are higher-value decisions than simple layer-4 or path-based controls, and they are exactly the kinds of controls enterprises care about when agents are allowed to take action on their behalf.

Envoy is well-positioned here because it can combine transport-level security with application-aware policy enforcement. Teams can authenticate workloads with mTLS and SPIFFE identities, then enforce protocol-specific rules with RBAC, external authorization, external processing, access logging, and CEL-based policy expressions.

This capability is crucial because it lets platform teams decouple agent development from enforcement. Developers can focus on building useful agents, while operators enforce a consistent zero-trust posture at the network layer, even as tools, models, and protocols continue to change.A prime example of this zero-trust decoupling is the critical "user-behind-agent" scenario, where an AI agent must execute tasks on a human user's behalf. Traditionally, handing user credentials directly to an application introduces severe security risks — if the agent is compromised or manipulated via prompt injection, an attacker could exfiltrate or misuse those credentials. By offloading identity management to Envoy, the proxy can automatically insert user delegation tokens into outbound requests at the infrastructure layer. Because the agent never directly holds the sensitive credential, the risk of a compromised agent misusing or leaking the token is completely neutralized, ensuring actions remain strictly bound to the user's actual permissions.

Case study: Restricting an agent to specific GitHub MCP tools
Consider an agent that triages GitHub issues.

The GitHub MCP server may expose dozens of tools, but the agent may only need a small read-only subset, such as list_issues, get_issue, and get_issue_comments. In most enterprises, that difference matters. A useful agent should not automatically become an unrestricted one.

With Envoy in front of the MCP server, the gateway can verify the agent identity using SPIFFE during the mTLS handshake, parse the MCP message via the deframing filter, extract the requested method and tool name, and enforce a policy that allows only the approved tool calls for that specific agent identity. RBAC uses metadata created by the MCP deframing filter to check the method and tool name in the MCP message:

code_block: <ListValue: [StructValue([('code', 'envoy.filters.http.rbac:\r\n "@type": type.googleapis.com/envoy.extensions.filters.http.rbac.v3.RBACPerRoute\r\n rbac:\r\n rules:\r\n policies:\r\n github-issue-reader-policy:\r\n permissions:\r\n - and_rules:\r\n rules:\r\n - sourced_metadata:\r\n metadata_matcher:\r\n filter: envoy.http.filters.mcp\r\n path: [{ key: "method" }]\r\n value: { string_match: { exact: "tools/call" } }\r\n - sourced_metadata:\r\n metadata_matcher:\r\n filter: envoy.http.filters.mcp\r\n path: [{ key: "params" }, { key: "name" }]\r\n value:\r\n or_match:\r\n value_matchers:\r\n - string_match: { exact: "list_issues" }\r\n - string_match: { exact: "get_issue" }\r\n - string_match: { exact: "get_issue_comments" }\r\n principals:\r\n - authenticated:\r\n principal_name:\r\n exact: "spiffe://cluster.local/ns/github-agents/sa/issue-triage-agent"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb05befbc40>)])]>

That’s the real value: Policy is enforced centrally, close to the traffic, and in terms that match the agent's actual behavior.

Beyond static rules: External authorization
A complex compliance policy that can’t be expressed using RBAC rules can be implemented in an external authorization service using the ext_authz protocol. Envoy provides MCP message attributes along with HTTP headers in the context of the ext_authz RPC. It can also forward the agent's SPIFFE identity from the peer certificate:

code_block: <ListValue: [StructValue([('code', 'http_filters:\r\n - name: envoy.filters.http.ext_authz\r\n typed_config:\r\n "@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz\r\n grpc_service:\r\n envoy_grpc:\r\n cluster_name: auth_service_cluster\r\n include_peer_certificate: true\r\n metadata_context_namespaces:\r\n - envoy.http.filters.mcp'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb05befbcd0>)])]>

This allows external services to make authorization decisions based on the full combination of agent identity, MCP method, tool name, and any other protocol attributes, without the agent or the MCP server needing to be aware of the policy layer.

Protocol-native error responses
When Envoy denies a request, the error should be meaningful to the calling agent. For MCP traffic, Envoy can use local_reply_config to map HTTP error codes to appropriate JSON-RPC error responses. For example, a 403 Forbidden can be mapped to a JSON-RPC response with isError: true and a human-readable message, ensuring the agent receives a protocol-appropriate denial rather than an opaque HTTP status code.

3. Envoy supports stateful agent interactions at scale

Not all agent traffic is stateless. Some protocols, including Streamable HTTP for MCP, can rely on session-oriented behavior. That creates a new challenge for intermediaries, especially when traffic flows through multiple gateway instances to achieve scale and resilience. An MCP session effectively binds the agent to the server that established it, and all intermediaries need to know this to direct incoming MCP connections to the correct server.

If a session is established on one backend, later requests in that conversation need to reach the right destination. That sounds straightforward for a single-proxy deployment, but it becomes more complicated in horizontally scaled systems, where multiple Envoy instances may handle different requests from the same agent.

Passthrough gateway
In the simpler passthrough mode, Envoy establishes one upstream connection for each downstream connection. Its primary use is enforcing centralized policies, such as client authorization, RBAC, rate limiting, and authentication, for external MCP servers. The session state transferred between intermediaries needs to include only the address of the server that established the session over the initial HTTP connection, so that all session-related requests are directed to that server.

Session state transfer between different Envoy instances is achieved by appending encoded session state to the MCP session ID provided by the MCP server. Envoy removes the session-state suffix from the session ID before forwarding the request to the destination MCP server. This session stickiness is enabled by configuring Envoy's envoy.http.stateful_session.envelope extension.

Aggregating gateway
In aggregating mode, Envoy acts as a single MCP server by aggregating the capabilities, tools, and resources of multiple backend MCP servers. In addition to enforcing policies, this simplifies agent configuration and unifies policy application for multiple MCP servers.

Session management in this mode is more complicated because the session state also needs to include mapping from tools and resources to the server addresses and session IDs that advertised them. The session ID that Envoy provides to the agent is created before tools or resources are known, and the mapping has to be established later, after the MCP initialization phases between Envoy and the backend MCP servers are complete.

One approach, currently implemented in Envoy, is to combine the name of a tool or resource with the identifier and session ID of its origin server. The exact tool or resource names are typically not meaningful to the agent and can carry this additional provenance information. If unmodified tool or resource names are desirable, another approach is to use an Envoy instance that does not have the mapping, and then recreate it by issuing a tools/list command before calling a specific tool. This trades latency for the complexity of deploying an external global store of MCP sessions, and is currently in planning based on user feedback.

This matters because it moves Envoy beyond simple traffic forwarding. It allows Envoy to serve as a reliable intermediary for real agent workflows, including those spanning multiple requests, tools, and backends.

4. Envoy supports agent discovery

Envoy is adding support for the A2A protocol and agent discovery via a well-known AgentCard endpoint. AgentCard, a JSON document with agent capabilities, enables discovery and multi-agent coordination by advertising skills, authentication requirements, and service endpoints. The AgentCard can be provisioned statically via direct response configuration or obtained from a centralized agent registry server via xDS or ext_proc APIs. A more detailed description of A2A implementation and agent discovery will be published in a forthcoming blog post.

5. Envoy is a complete solution for agentic networking challenges

Building on the same foundation that enabled policy application for MCP protocol in demanding deployments, Envoy is adding support for OpenAI and transcoding of agentic protocols into RESTful HTTP APIs. This transcoding capability simplifies the integration of gen AI agents with existing RESTful applications, with out-of-the-box support for OpenAPI-based applications and custom options via dynamic modules or Wasm extensions. In addition to transcoding, Envoy is being strengthened in critical areas for production readiness, such as advanced policy applications like quota management, comprehensive telemetry adhering to OpenTelemetry semantic conventions for generative AI systems, and integrated guardrails for secure agent operation.

Guardrails for safe agents
The next significant area of investment is centralized management and application of guardrails for all agentic traffic. Integrating policy enforcement points with external guardrails presently requires bespoke implementation and this problem area is ripe for standardization.

Control planes make this operational

The gateway is only part of the story. To achieve this policy management and rollout at scale, a separate control plane is required to dynamically configure the data plane using the xDS protocol, also known as the universal data plane API.

That is where control planes become important. Cloud Service Mesh, alongside open-source projects such as Envoy AI Gateway and kube-agentic-networking, uses Envoy as the data plane while giving operators higher-level ways to define and manage policy for agentic workloads.

This combination is powerful: Envoy provides the enforcement and extensibility in the traffic path, while control planes provide the operating model teams need to deploy that capability consistently.

Why this matters now

The shift towards agentic systems and gen AI protocols such as MCP, A2A, and OpenAI necessitates an evolution in network intermediaries. The primary complexities Envoy addresses include:

Deep protocol inspection. Protocol deframing extensions extract policy-relevant attributes (tool names, model names, resource paths) from the body of HTTP requests, enabling precise policy enforcement where traditional proxies would only see an opaque byte stream.
Fine-grained policy enforcement. By exposing these internal attributes, existing Envoy extensions like RBAC and ext_authz can evaluate policies based on protocol-specific criteria. This allows network operators to enforce a unified, zero-trust security posture, ensuring agents comply with access policies for specific tools or resources.
Stateful transport management. Envoy supports managing session state for the Streamable HTTP transport used by MCP, enabling robust deployments in both passthrough and aggregating gateway modes, even across a fleet of intermediaries.

Agentic AI protocols are still in their early stages, and the protocol landscape will continue to evolve. That’s exactly why the networking layer needs to be adaptable. Enterprises should not have to rebuild their security and traffic infrastructure every time a new agent framework, transport pattern, or tool protocol gains traction. They need a foundation that can absorb change without sacrificing control.

Envoy brings together three qualities that are hard to get in one place: proven production maturity, deep extensibility, and growing protocol awareness for agentic workloads. By leveraging Envoy as an agent gateway, organizations can decouple security and policy enforcement from agent development code.

That makes Envoy more than just a proxy that happens to handle AI traffic. It makes Envoy a future-ready foundation for agentic AI networking.

^{Special thanks to the additional co-authors of this blog: Boteng Yao, Software Engineer, Google and Tianyu Xia, Software Engineer, Google and Sisira Narayana, Sr Product Manager, Google.}

Introducing multi-cluster GKE Inference Gateway: Scale AI workloads around the world

Tue, 17 Mar 2026 16:00:00 +0000

The world of artificial intelligence is moving fast, and so is the need to serve models reliably and at scale. Today, we're thrilled to announce the preview of multi-cluster GKE Inference Gateway to enhance the scalability, resilience, and efficiency of your AI/ML inference workloads across multiple Google Kubernetes Engine (GKE) clusters — even those spanning different Google Cloud regions.

Built as an extension of the GKE Gateway API, the multi-cluster Inference Gateway leverages the power of multi-cluster Gateways to provide intelligent, model-aware load balancing for your most demanding AI applications.

Why multi-cluster for AI inference?

As AI models grow in complexity and users become more global, single-cluster deployments can face limitations:

Availability risks: Regional outages or cluster maintenance can impact service.
Scalability caps: Hitting hardware limits (GPUs/TPUs) within a single cluster or region.
Resource silos: Underutilized accelerator capacity in one cluster can’t be used by another
Latency: Users far from your serving cluster may experience higher latency

The multi-cluster GKE Inference Gateway addresses these challenges head-on, providing a variety of features and benefits:

Enhanced high reliability and fault tolerance: Intelligently route traffic across multiple GKE clusters, including across different regions. If one cluster or region experiences issues, traffic is automatically re-routed, minimizing downtime.
Improved scalability and optimized resource usage: Pool and leverage GPU/TPU resources from various clusters. Handle demand spikes by bursting beyond the capacity of a single cluster and efficiently utilize available accelerators across your entire fleet.
Globally optimized, model-aware routing: The Inference Gateway can make smart routing decisions using advanced signals. With GCPBackendPolicy, you can configure load balancing based on real-time custom metrics, such as the model server's KV cache utilization metric, so that requests are sent to the best-equipped backend instance. Other modes like in-flight request limits are also supported.
Simplified operations: Manage traffic to a globally distributed AI service through a single Inference Gateway configuration in a dedicated GKE "config cluster," while your models run in multiple "target clusters."

How it works

In GKE Inference Gateway there are two foundational resources, InferencePool and InferenceObjective. An InferencePool acts as a resource group for pods that share the same compute hardware (like GPUs or TPUs) and model configuration, helping to ensure scalable and high-availability serving. An InferenceObjective defines the specific model names and assigns serving priorities, allowing Inference Gateway to intelligently route traffic and multiplex latency-sensitive tasks alongside less urgent workloads.

With this release, the system uses Kubernetes Custom Resources to manage your distributed inference service. InferencePool resources in each "target cluster" group model-server backends. These backends are exported and become visible as GCPInferencePoolImport resources in the "config cluster." Standard Gateway and HTTPRoute resources in the config cluster define the entry point and routing rules, directing traffic to these imported pools. Fine-grained load-balancing behaviors, such as using CUSTOM_METRICS or IN_FLIGHT requests, are configured using the GCPBackendPolicy resource attached to GCPInferencePoolImport.

This architecture enables use cases like global low-latency serving, disaster recovery, capacity bursting, and efficient use of heterogeneous hardware.

For more information about GKE Inference Gateway core concepts check out our guide.

Get started today

As you scale your AI inference serving workloads to more users in more places, we're excited for you to try multi-cluster GKE Inference Gateway. To learn more and get started, check out the documentation:

The AI-native core: Highly resilient telco architecture using Google Kubernetes Engine

Wed, 04 Mar 2026 08:00:00 +0000

The telecommunications industry has reached a critical tipping point. Traditional, on-premises-heavy data center models are struggling under the weight of escalating infrastructure costs and an under utilization due to availability and compliance requirements. But the AI era demands exponential scale and beyond-nines reliability. The question for operators is no longer if they should modernize, but which architectural path will help them do that fastest.

Modernization isn't a "rip and replace" event; it’s a strategic choice. Today, we’re showcasing how Google Kubernetes Engine (GKE) can serve as a high-performance foundation for two versatile deployment strategies: cloud-centric evolution and strategic hybrid modernization.

The two paths to network modernization

Every operator has a unique appetite for risk, regulatory landscape, and investment base, with some prioritizing agility, and others emphasizing the need for local control. You can use GKE to support both approaches:

1. Cloud- centric modernization: Agility at scale

This path is for operators looking to fully harness the cloud's elasticity. Whether you’re migrating your own containerized network functions (CNFs) or building a cloud-native service like Ericsson-on-Demand, the goal is the same: move the heavy lifting to Google Cloud.

The benefit: By running mission-critical workloads like Voice Core or Policy Control Functions on Google's global fiber backbone, operators can scale instantly for peak events and move toward "zero-human-touch" operations.
The economics: Transition from heavy upfront CAPEX to a "pay-as-you-grow" model. You no longer need to over-provision hardware that sits idle; the cloud absorbs the bursts for you.
Time to market: Accelerate time to market for new services like fixed wireless access, IoT and private 5G.

2. Strategic hybrid modernization: Cloud agility, local control

For many telcos, a hybrid approach offers a better balance. Here, operators can selectively move agile control plane components and data analytics to the cloud while keeping latency-sensitive user-plane functions on premises or at the edge.

The benefit: Optimize for ultra-low latency and meet strict data sovereignty requirements by keeping data plane traffic local, while still gaining the AI-driven insights and orchestration power of the cloud.
The versatility: Using GKE, you can run your control plane workloads in the cloud and data plane services directly in your own data centers or at the network edge, enjoying a unified operational model across your environments.

Engineering the "telco-grade" foundation

Today, we are proud to showcase how GKE has evolved into the industry's most specialized platform for containerized network functions (CNFs), backed by massive momentum from operators and equipment vendor partners.

It’s achieved this thanks to a variety of capabilities.

Connectivity and isolation

Standard Kubernetes wasn't designed for the complex traffic separation that telcos require. GKE bridges this gap with:

Multi-networking API: A native Kubernetes way to manage multiple interfaces per Pod, bringing standard Network Policies to every interface.
Simulated L2 networking: A "migration superpower" that allows legacy applications to maintain their Layer-2 operational model while running on a modern cloud-native stack.
The telco CNI: Support for Multus, IPvlan, and Whereabouts on specialized Ubuntu images. This allows operators to isolate management, control, and user planes with surgical precision.

Persistent reachability

In a world of ephemeral containers, telco functions need stability. GKE enables this through:

GKE IP route: We’ve integrated equal-cost multi-path (ECMP)-like functionality directly into the GKE dataplane. If a workload fails, it is automatically and rapidly removed from the service path, providing high availability without complex external router configurations.
Persistent IP: GKE provides the static IP support that 5G core functions require for consistent reachability across their lifecycle without NAT that isn't available on standard Kubernetes.

Sub-second convergence

For telcos, every millisecond of downtime is a lost connection. GKE’s dataplane via HA Policy is optimized for near-zero downtime with ultra-fast failure detection and convergence, offering operators the choice between self-managed recovery or fully Google-managed failure detection.

Shifting from "saving" to "solving" with AI

For operators, the ultimate goal of modernization is to transition to an autonomous network. By running the core network functions on a platform adjacent to Google Cloud AI and data platforms such as Vertex AI and BigQuery, they can turn telemetry into actionable changes to optimize the network. Some use cases and benefits that modernization enables include:

Predictive AIOps: Use AI to identify performance degradation and trigger automated healing before a call ever drops. Use the cloud for on-demand burst capacity during sporting events or service launches. Or use the data from your GKE-hosted 5G core to fuel AI-powered automation that anticipates issues before they impact subscribers.
Intent-driven programmability: Shift from expensive, reactive operations and cut down new deployment setup times from several weeks to a couple of hours.
Monetize insights: Leverage AI on cloud-native data to identify and capture entirely new revenue opportunities in addition to rightsizing your networks.

Your journey, your terms

The future of telco is intelligent, resilient, and incredibly flexible. Whether you are taking your first step into a hybrid deployment or launching a fully cloud-hosted core, Google Cloud is your strategic partner.

Join us at MWC: Visit booth #2H40 in Hall 2 to see these solutions in action, including live demonstrations of mobile core running on GKE.

Designing private network connectivity for RAG-capable gen AI apps

Mon, 02 Mar 2026 17:00:00 +0000

The flexibility of Google Cloud allows enterprises to build secure and reliable architecture for their AI workloads. In this blog we will look at a reference architecture for private connectivity for retrieval-augmented generation (RAG)-capable generative AI applications. This architecture is for scenarios where communications of the overall system must use private IP addresses and must not traverse the internet.

The power of RAG

RAG is a powerful technique used to optimize the output of large language models (LLMs) by grounding them in specific, authoritative knowledge bases outside of their original training data. RAG allows an application to retrieve relevant information from your documents, datasources, or databases in real time. This retrieved context is then provided to the model alongside the user’s query, helping to ensure that the AI’s responses are accurate, verifiable, and highly relevant to your business. This improves the quality of responses and reduces hallucinations.

This approach is helpful because it allows you to direct generative AI to use a designated source of truth, rather than relying solely on the model's pre-existing knowledge, and without needing to retrain or fine-tune the model itself.

Design pattern example

To understand how to think about setting up your network for private connectivity for a RAG application in a regional design, let's look at the design pattern.

The setup comprises an external network (on-prem and other clouds) and Google Cloud environments consisting of a routing project, a Shared VPC host project for RAG, and three specialized service projects: data ingestion, serving, and frontend.

This design utilizes the following services to provide an end-to-end solution:

Cloud Interconnect or Cloud VPN: To securely connect from your on-premises or other clouds to the routing VPC network
Network Connectivity Center: Used as an orchestration framework to manage connectivity between the routing VPC network and the RAG VPC network via VPC spokes and hybrid spokes
Cloud Router: In the routing project, facilitates dynamic BGP route exchange between the external network and Google Cloud
Private Service Connect: Provides a private endpoint in the routing VPC network to reach the Cloud Storage bucket for data ingestion without traversing the public internet
Shared VPC: Host project architecture that allows multiple service projects to use a common, centralized VPC network
Google Cloud Armor and Application Load Balancer: Placed in the frontend service project to provide security and traffic management for user interaction
VPC Service Controls: Creates a managed security perimeter around all resources to mitigate data exfiltration risks

The traffic flow

RAG population flow

In the diagram, the green dashed line shows the RAG population flow, which describes how data travels from data engineers to vector storage.

From the external network, data travels over Cloud Interconnect or Cloud VPN.
In the routing projects it uses the Private Service Connect endpoint to get to the Cloud Storage bucket.
From the Cloud Storage bucket in the Data Ingestion service project, the data ingestion subsystem processes the raw data.
The AI model creates vectors from the chunks, returns them to the data ingestion subsystem, which writes them to the RAG datastore in the serving service project.

Inference flow

In the diagram, the orange dashed line shows the inference flow, which describes customer or user requests.

The request travels over Cloud Interconnect or Cloud VPN to the routing VPC network and then over the VPC spoke to the RAG VPC network.
The request reaches the Application Load Balancer protected by Cloud Armor; once allowed, it passes it to the frontend subsystem.
The frontend subsystem forwards the request to the serving subsystem, which augments the prompt with data from the RAG datastore and generates a response via the AI model.
The system generates a response via the AI model, and the grounded response is returned along the same path to the requestor.

Management and routing

In the diagram, the blue dotted lines represent the Network Connectivity Center hybrid and VPC spokes that manage the control plane and route orchestration between the routing network and the RAG VPC network. This ensures that routes learned from the external network are appropriately propagated across the environment.

Please read the entire architecture document Private connectivity for RAG-capable generative AI applications to understand the specific including IAM permissions, VPC Service Controls, and deployment considerations.

Next steps

Take a deeper dive into the Cross-Cloud Network, and other guides about generative AI with RAG:

Document set: Generative AI with RAG
Document: Cross-Cloud Network for distributed applications
Blog: Build Your First ADK Agent Workforce

Want to ask a question, find out more or share a thought? Please connect with me on Linkedin.

Firefly: Illuminating the path to nanosecond-level clock sync in the data center

Mon, 23 Feb 2026 17:00:00 +0000

From the high-frequency trading floors of Wall Street to orchestrating cloud data centers, the ability to synchronize events with nanosecond accuracy is critical. Yet, achieving this level of temporal precision across thousands of interconnected devices in a modern data center is fraught with challenges like clock drift, network jitter, and path asymmetries. And doing so on cloud-hosted infrastructure has traditionally been impossible, preventing a certain class of applications from running there.

This is where Firefly, a clock synchronization system developed by researchers and engineers at Google, comes in. Firefly isn't just a clock synchronization protocol; it's a software-driven approach that combines theoretical insights and practical engineering to deliver ultra-accurate, scalable, and cost-effective time synchronization on commodity hardware within a demanding data center environment.

The nanosecond race: Why precise timing matters

Precise clock synchronization is the foundation of distributed systems. It is non-negotiable in financial exchanges, where regulatory requirements mandate sub-100µs external synchronization to Coordinated Universal Time, or UTC, and fairness demands sub-10ns internal clock synchronization. In high-frequency trading, a minuscule timing advantage can translate to significant financial gains, making accurate timestamping critical for market integrity. Beyond finance, numerous data center operations, including database consistency, distributed logging, virtual machine management, and network telemetry, rely on accurate temporal ordering of events. And as data centers scale, the need for a robust, scalable synchronization solution becomes even more important.

But achieving nanosecond-level synchronization in a dynamic data center environment is difficult. Several factors conspire to undermine precision:

Clock drift: Crystal oscillators, which are fundamental to all clocks, have inherent imperfections that cause them to gradually deviate over time. Although these deviations were considered minor previously, they are substantial when targeting sub-10ns.
Jitter: Network components such as switches and network interface cards (NICs) introduce unpredictable delays. These delays, often stemming from queuing in network buffers or the intricate processing of packets, can manifest as jitter, disrupting the timing of synchronization messages.
Asymmetry: The network path between two devices is rarely symmetrical. Differences in cable lengths, the number of hops, or the internal workings of network equipment can cause signals to take different amounts of time to travel in opposite directions. This asymmetry can introduce significant errors when estimating one-way delays and clock offsets.
Scalability: As data centers expand to house tens of thousands of servers, any synchronization solution must be able to scale efficiently without becoming a bottleneck or requiring disproportionate resources.
Fault tolerance: In a distributed system, failures are inevitable. A synchronization protocol must be resilient to the loss or misbehavior of individual nodes or network links, so that the overall synchronization accuracy is not compromised.

Firefly: Bridging software and theory

Firefly uses a multi-faceted strategy to tackle these challenges, distinguishing itself from prior synchronization protocols. Its core innovations lie in its architectural design and its theoretical underpinnings.

1. Layered synchronization: Firefly employs a novel layered synchronization technique. Instead of relying on a central clock, which can be a single point of failure or introduce delays, it first establishes tight internal synchronization amongst NICs within the data center. Each NIC in the network constantly communicates with a set of its peers, comparing times and making adjustments. From this "swarm" of devices emerges a highly stable and accurate consensus time that the entire group agrees upon. This internal synchronization is rapid and robust, effectively shielding it from external timing disturbances. Concurrently, Firefly synchronizes the entire swarm to UTC. Decoupling of these two processes is crucial, as it prevents external factors like time-server jitter or drift from directly impacting internal synchronization.

2. Distributed consensus over Random graphs: Unlike traditional hierarchical approaches that can be brittle and susceptible to single points of failure, Firefly uses a distributed consensus algorithm built on a d-regular random graph. This means each NIC communicates with a randomly selected set of 'd' peers. Theoretical analysis, as presented in the Firefly research paper, demonstrates that such random graphs offer significant advantages:

Faster convergence: Random graphs promote a more rapid dissemination of clock information across the network, leading to quicker synchronization.
Scalability: The theoretical bounds show that random graphs can maintain synchronization accuracy even as the size of the network grows, provided the number of peers ('d') scales logarithmically with the total number of nodes.
Resilience to asymmetry: The diverse probing paths inherent in random graphs help to average out and mitigate the impact of path asymmetries.

3. Mitigating jitter and asymmetry in practice: Beyond the theoretical advantages of random graphs, Firefly incorporates practical techniques to further refine accuracy:

RTT filtering: By analyzing round-trip time (RTT) measurements, Firefly can identify and discard probe samples that are likely affected by queuing jitter, thereby improving the accuracy of delay estimations.
Path profiling: Firefly actively probes network paths to identify and favor those with minimal asymmetry. This proactive approach helps to select the most reliable paths for synchronization.
Leveraging hardware: Where available, Firefly can utilize features like Transparent Clock (TC) in network switches to accurately account for in-switch delays, further reducing measurement error.

4. Robustness and fault tolerance: Firefly’s use of distributed consensus, combined with its averaging mechanisms, makes it inherently resilient to failures. By not relying on a single time server or a fixed hierarchical structure, the system can gracefully handle the loss or misbehavior of individual nodes.

Performance in the real world

The results discussed in our Firefly research paper are compelling:

Internal synchronization: Firefly consistently achieves sub-10ns NIC-to-NIC synchronization when used in conjunction with Google's latest data center fabric technology. This can be used to determine order of events like packets, logs, remote procedure calls (RPCs) across machines.
External synchronization: The system also delivers significantly better synchronization to UTC than the 100µs regulatory requirement for financial exchanges.

The offset between a pair of clocks that are six hops away in a Firefly-synced network, measured by an oscilloscope via 1 pulse per second.

The accompanying video illustrates the accuracy of NIC-to-NIC synchronization, as quantified by an oscilloscope utilizing a one-pulse-per-second (1PPS) signal from the NICs. Each row corresponds to a NIC clock, with the rising edge indicating the precise moment the NIC clock attains an integer second. The oscilloscope observations confirm that all measured NICs exhibit close synchronization, maintaining alignment within a few nanoseconds.

These results are particularly impressive given that Firefly operates purely in software on commodity hardware, avoiding the need for expensive, specialized synchronization equipment. This makes ultra-accurate time synchronization accessible to a broader range of data center applications.

A foundation for future applications

Firefly's success in delivering nanosecond-level accuracy in a scalable and cost-effective manner has far-reaching implications:

Democratizing high-precision timing: Firefly allows cloud-hosted financial services that traditionally rely on expensive dedicated hardware, to achieve the required precision using standard cloud infrastructure.
Enabling new applications: The availability of precise, synchronized clocks across data center devices can unlock new possibilities in areas like fine-grained network telemetry and congestion control, time-coordinated distributed systems, and deterministic fabric for ML workloads.
Transforming data center operations: By creating a tightly integrated and precisely timed computing entity, Firefly can enhance data centers’ overall efficiency, reliability, and performance.

In conclusion, Firefly represents a significant advancement in the field of clock synchronization. By ingeniously combining theoretical insights into graph theory and consensus algorithms with practical network engineering techniques, it overcomes the long-standing challenges of achieving nanosecond-level precision in complex, distributed environments. As data centers continue to evolve, systems like Firefly will be instrumental in building the high-performance, reliable, and fair infrastructure of the future.

aside_block: <ListValue: [StructValue([('title', '2026 AI Agent Trends in Financial Services'), ('body', <wagtail.rich_text.RichText object at 0x7fb064f7a940>), ('btn_text', 'Read it now.'), ('href', 'https://cloud.google.com/resources/content/ai-agent-trends-financial-services-2026'), ('image', <GAEImage: FSI_Confirmation email_500x450>)])]>

Google Distributed Cloud brings public-cloud-like networking to air-gapped environments

Tue, 10 Feb 2026 17:00:00 +0000

Organizations in highly regulated industries often struggle to balance the rigid security of air-gapped environments with the need for the agility and flexibility that the cloud provides. To address this, Google Distributed Cloud (GDC) air-gapped 1.15 introduces new networking features in preview that give you more direct control and visibility without compromising your security posture, as well as a new IPAM feature in general availability that simplifies subnet management. These preview features are Cloud NAT, enhanced connectivity for standard clusters, and advanced HTTP and HTTPS health checks in load balancers. Together, they make it easier for you to manage complex workloads in a secure environment.

Manage outbound traffic with Cloud NAT

Cloud NAT for GDC air-gapped replaces previous egress solutions and gives you more control over how your instances communicate with other networks, on par with public cloud functionality. Cloud NAT provides several benefits:

Configurable egress IPs: You can assign and manage multiple egress IP addresses for your outbound traffic so you can identify exactly which workloads are communicating.
Customizable timeouts: Manage connection lifecycles by adjusting timeouts for different types of traffic.
Granular control: Administrators can create specific subnets for egress IPs, while application operators define how pods and VMs route their traffic.

Connect standard clusters directly to your organization

In a secure environment, isolation should not result in disconnected silos. With the latest release, standard clusters include networking updates that help you communicate across your organization while maintaining strict security boundaries, helping you manage your environment more effectively. The updates include:

Direct pod communication: Your standard cluster pods can now communicate directly with workloads in your organization’s Default VPC. This simplifies how you connect standard clusters and shared clusters.
Flexible firewall policies: You can use both Project Network Policy and Kubernetes Network Policy APIs to set granular rules for traffic entering and leaving your pods and nodes.
Managed load balancing: You can create internal and external load balancers using standard Kubernetes Service APIs, while GDC manages the underlying configuration for you.

Pods within a standard cluster can now connect to other pods directly or through a ClusterIP. While traffic to the Infra VPC remains blocked, you can send traffic to shared cluster workloads through GDC internal load balancers. This ensures your applications can reach necessary services quickly.

Improve reliability with Load Balancer HTTP and HTTPS health checks

Previously, L4 load balancing health checks only monitored basic TCP connectivity, only confirming if a port was open. GDC air-gapped load balancers now support HTTP and HTTPS health checks, which allow you to verify if an application is actually functioning correctly. By checking status codes and response content, you can:

Confirm application health: Verify that services are responding correctly, not just that the server is powered on.
Increase reliability: Automatically detect and route traffic away from applications experiencing internal errors.
Improve visibility: Access better data regarding the health of your VM-based workloads to manage performance before issues arise.

Make subnet management easier with subnet groups

Previously, a child subnet could only reference a single parent subnet. With the introduction of the subnet group, a child subnet can now reference a subnet group that may contain multiple parent subnets. This provides the following benefits:

Overcome the challenges of immutable subnet CIDR: While subnet CIDR range is immutable, subnet group simplifies scaling up IP resources by attaching a new subnet to a subnet group. You can reference a subnet group instead of a single parent subnet for easy scale-up.
Automatically identify a parent subnet: Now you can reference a subnet group as parent rather than as a single subnet. By using a subnet group in this way, you don't need to manually identify a parent subnet that has available IP resources: instead, GDC IPAM automatically finds a subnet in the subnet group with enough available IP space as its parent.
Start with smaller CIDRs for easier planning: Using subnet groups to scale IP resources also means that you can start with smaller and discontinuous CIDRs when creating new parent subnets, making IP resource utilization more efficient and the planning process easier.

Get started

To learn more about these features, please refer to our documentation or contact your Google Cloud account team.