Developers & Practitioners

Safely run AI-generated code in Cloud Run sandboxes

Thu, 09 Jul 2026 16:30:00 +0000

Here’s a question we hear often at Google Cloud: How do you safely run AI-generated code or untrusted binaries without putting your host application, data, and cloud credentials at risk? In other words, how do you give AI-written programs a safe space to run — one that keeps them completely separate from your trusted programs with higher privileges?

Until now, developers had to build complex sandboxing infrastructure using container clusters or pay for specialized third-party microVM runtimes.

Today, at WeAreDevelopers World Congress, we are announcing Google Cloud Run sandboxes in public preview. Cloud Run sandboxes are a native, secure, and ultra-fast runtime environment built specifically for executing untrusted code and agent workloads, starting in milliseconds.

In the following example, we send requests to safely execute untrusted Python code on a Cloud Run service that starts, executes, and stops 1,000 sandboxes with an average of 500ms latency:

In this post, we’ll share more about the feature and core use cases.

What is a Cloud Run sandbox?

Cloud Run sandboxes are lightweight, isolated execution boundaries that you can spawn near-instantly within your existing Cloud Run service instances.

Whether you need to let an LLM run a dynamically generated Python script to calculate business margins or spin up a headless browser to perform web research, Cloud Run sandboxes give you a secure, isolated sandbox to run these tasks without leaving your serverless environment.

Core use cases

LLM code interpreters: Build advanced data analysis features into your AI products. Let your models write and execute Python, R, or SQL code to analyze datasets, generate charts, and perform complex math securely.
Headless browsers: Give your agents a secure environment to run browsers. Safely scrape web pages, take screenshots, and automate web workflows without risking your host machine.
User-submitted code execution: Beyond AI, platforms hosted on Cloud Run can use sandboxes to safely run custom scripts, plugins, or webhooks uploaded by their own end-users.

How it works: The developer experience

Enabling sandboxes on your Cloud Run service is as simple as adding a single flag to your deployment.

Step 1: Enable the sandbox launcher

When deploying your Cloud Run service, enable the sandbox launcher via gcloud or your YAML configuration:

code_block: <ListValue: [StructValue([('code', 'gcloud beta run deploy my-agent-service \\\r\n --image=gcr.io/my-project/agent-image \\\r\n --sandbox-launcher'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22043cdc0>)])]>

Step 2: Spawn a sandbox natively in your code

Once enabled, a lightweight sandbox CLI binary is automatically mounted into your execution environment. Your agent application can spawn sandboxes programmatically using standard subprocess calls.

Here is how easily you can run an untrusted Python script generated by an LLM:

code_block: <ListValue: [StructValue([('code', 'import subprocess\r\n\r\ndef run_untrusted_code(llm_code: str):\r\n # 1. Write the untrusted LLM code to a local file\r\n with open("/tmp/generated_script.py", "w") as f:\r\n f.write(llm_code)\r\n \r\n # 2. Run it inside the secure sandbox\r\n # The sandbox shares your container\'s filesystem tools but runs in a secure silo\r\n result = subprocess.run(\r\n ["sandbox", "do", "--", "python3", "/tmp/generated_script.py"],\r\n capture_output=True,\r\n text=True,\r\n timeout=10\r\n )\r\n \r\n return result.stdout if result.returncode == 0 else result.stderr'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3630d90>)])]>

Security by design: Zero-trust by default

Cloud Run sandboxes are engineered to protect your host application and cloud resources from malicious or erroneous code execution. The runtime enforces three critical security boundaries:

1. Credential and environment isolation: These sandboxes do not have access to the Cloud Run service’s environment variables nor do they have the ability to call the Google Cloud metadata server.

2. Locked-down network egress (deny-by-default): By default, sandboxes have zero outbound network access. If your agent is tricked into running a script that attempts to exfiltrate data to a malicious server, the network request is blocked at the system layer. Egress can be enabled only when explicitly requested:

code_block: <ListValue: [StructValue([('code', 'sandbox do --allow-egress -- curl https://api.github.com'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3630670>)])]>

3. Safe filesystem overlay: The sandbox runs with a read-only view of your container's filesystem (allowing it to use your installed packages, Python runtimes, and binaries) but writes all changes to an isolated, temporary memory overlay. Once the sandbox execution ends, all generated files are discarded. Though you can still import and export files as needed for re-use across sandboxes:

code_block: <ListValue: [StructValue([('code', '# Write data from the sandbox to an archive file that can be persisted\r\nsandbox do --write --export-tar=/tmp/work.tar \\\r\n -- /bin/bash -c "mkdir -p /tmp/work && echo \'task-complete\' > /tmp/work/status.txt"\r\n\r\n# Import the archive file in a new sandbox\r\nsandbox do --write --import-tar=/tmp/work.tar \\\r\n -- /bin/bash -c "cat /tmp/work/status.txt"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3630bb0>)])]>

ADK and ComputeSDK built-in support

Cloud Run sandboxes will be supported in the next version of Agent Development Kit with a new CloudRunSandboxCodeExecutor. This integration gives your ADK agents running on Cloud Run the ability to execute code in one single line:

code_block: <ListValue: [StructValue([('code', 'from google.adk.agents import Agent\r\nfrom google.adk.integrations.cloud_run import CloudRunSandboxCodeExecutor\r\n\r\nanalyst_agent = Agent(\r\n name="cloud_run_data_analyst",\r\n model="gemini-3.1-pro-preview",\r\n system_instruction=(\r\n "You are an expert data analyst. Write and execute Python code to answer "\r\n "user questions and process data safely."\r\n ),\r\n code_executor=CloudRunSandboxCodeExecutor(),\r\n)'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f36305b0>)])]>

Cloud Run sandboxes were also added to ComputeSDK, a vendor agnostic SDK for running sandboxes. This SDK allows you to either invoke sandboxes remotely from outside the Cloud Run service or use them directly as a local tool on the service. You can learn how to use this SDK for Cloud Run sandboxes here.

Get started today

Unlike dedicated sandbox hosting platforms that charge high premiums for on-demand virtual machines, Cloud Run sandboxes run directly on your existing allocated CPU and memory. Because the sandboxes share the resources of your running instances, there is no additional cost or premium to use this feature. You can check out our documentation here.

Autopilot Clusters with GKE managed DRANET: GPUs and TPUs

Thu, 09 Jul 2026 07:00:00 +0000

Google Kubernetes Engine (GKE) managed DRANET supports both GPUs and TPUs. There are several configurations to use this implementation, including standard cluster (where you have full control) and autopilot cluster (where Google does the heavy configs for you). I've been exploring the capabilities and in this blog we will explore setting up for autopilot clusters.

Autopilot and managed DRANET

GKE autopilot is a managed version of GKE that handles nodes, scaling, security, and other preconfigured settings. GKE managed DRANET lets you request and allocate networking resources for your Pods, including network interfaces that support TPUs and Remote Direct Memory Access (RDMA).

Setup flow

To deploy your GKE autopilot cluster and enable managed DRANET, you need to create a Virtual Private Cloud (VPC). Let's walk through the setup:

Deploy an Autopilot cluster.
Create a custom ComputeClass which supports the accelerator type (TPU or GPU)
Create a ResourceClaimTemplate for GPUs (RDMA) or non-GPU (TPU)
Deploy workload and reference the ComputeClass and ResourceClaimTemplate to get the correct networking set up.

Now let's explore the configs for both TPU and GPU.

Configure variables:

code_block: <ListValue: [StructValue([('code', 'export PROJECT_ID=$(gcloud config get project) #automatically sets your Project_ID\r\nexport REGION="REGION"\r\nexport CLUSTER_NAME="CLUSTER_NAME"\r\nexport NETWORK="NETWORK"\r\nexport SUBNETWORK="SUBNETWORK"\r\nexport RESERVATION_URL="RESERVATION_URL"\r\nexport HF_TOKEN="HUGGING_FACE_TOKEN"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc220557790>)])]>

Replace the following:

REGION: The region where you want to create your cluster, such as us-east1. You can only create the cluster in the region where your reservation or resources exists.
CLUSTER_NAME: A name for your cluster, such as dranet-cluster.
NETWORK: The name of the VPC network.
SUBNETWORK: The name of the subnet in the VPC.
RESERVATION_URL: The URL of the reservation that you want to use to create your resources.
HUGGING_FACE_TOKEN: The Hugging Face access token to download your model.

1. Deploy an Autopilot cluster

Deploy an Autopilot cluster.

code_block: <ListValue: [StructValue([('code', 'gcloud container clusters create-auto $CLUSTER_NAME \\\r\n --project=$PROJECT_ID \\\r\n --region=$REGION \\\r\n --release-channel=rapid \\\r\n --network=$NETWORK \\\r\n --subnetwork=$SUBNETWORK'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc220557d60>)])]>

2. Create a custom ComputeClass

Example: GPU B200 custom ComputeClass with managed DRANET support and a reservation.

code_block: <ListValue: [StructValue([('code', 'apiVersion: cloud.google.com/v1\r\nkind: ComputeClass\r\nmetadata:\r\n name: dranet-a4-computeclass\r\nspec:\r\n nodePoolAutoCreation:\r\n enabled: true\r\n nodePoolConfig:\r\n dra:\r\n networking:\r\n enabled: true\r\n priorities:\r\n - machineType: a4-highgpu-8g\r\n gpu:\r\n count: 8\r\n type: nvidia-b200\r\n acceleratorNetworkProfile: auto\r\n reservations:\r\n affinity: Specific\r\n specific:\r\n - name: ${RESERVATION_URL}\r\n project: ${PROJECT_ID}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc220557fa0>)])]>

Replace the following:

${RESERVATION} : With the URL of the reservation that you want to use to create your resources.
${PROJECT_ID}: With the ID of the project you are using.

Alternatively you can set the variables in your terminal and use the following command to pass the variables at creation envsubst < filename.yaml | kubectl apply -f -

Example: TPU v6e custom ComputeClass using on-demand example.

code_block: <ListValue: [StructValue([('code', 'apiVersion: cloud.google.com/v1\r\nkind: ComputeClass\r\nmetadata:\r\n name: dra-gke-auto\r\nspec:\r\n nodePoolAutoCreation:\r\n enabled: true\r\n nodePoolConfig:\r\n dra:\r\n networking:\r\n enabled: true\r\n priorities:\r\n - tpu:\r\n type: tpu-v6e-slice\r\n count: 8\r\n topology: "2x4" \r\n acceleratorNetworkProfile: auto\r\n location:\r\n zones: \r\n - us-east5-b'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc220557070>)])]>

3. Create a ResourceClaimTemplate

RDMA support deviceClassName: mrdma.google.com ResourceClaimTemplate example for GPUs:

code_block: <ListValue: [StructValue([('code', 'apiVersion: resource.k8s.io/v1\r\nkind: ResourceClaimTemplate\r\nmetadata:\r\n name: all-mrdma\r\nspec:\r\n spec:\r\n devices:\r\n requests:\r\n - name: req-mrdma\r\n exactly:\r\n deviceClassName: mrdma.google.com\r\n allocationMode: All'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205574c0>)])]>

Non-RDMA deviceClassName: netdev.google.com ResourceClaimTemplate example for TPUs.

code_block: <ListValue: [StructValue([('code', 'apiVersion: resource.k8s.io/v1\r\nkind: ResourceClaimTemplate\r\nmetadata:\r\n name: all-netdev\r\nspec:\r\n spec:\r\n devices:\r\n requests:\r\n - name: req-netdev\r\n exactly:\r\n deviceClassName: netdev.google.com\r\n allocationMode: All'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc220557a60>)])]>

4. Deploy workload and reference ComputeClass and ResourceClaim

Create a secret in your cluster

code_block: <ListValue: [StructValue([('code', 'kubectl create secret generic hf-secret \\\r\n --from-literal=hf_token=${HF_TOKEN}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc220557040>)])]>

Example deploying GPUs

code_block: <ListValue: [StructValue([('code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: gemma-4-31-deploy\r\nspec:\r\n replicas: 2\r\n selector:\r\n matchLabels:\r\n app: gemma4\r\n template:\r\n metadata:\r\n labels:\r\n app: gemma4\r\n ai.gke.io/model: gemma-4-31b\r\n ai.gke.io/inference-server: vllm\r\n spec:\r\n resourceClaims:\r\n - name: rdma-claim \r\n resourceClaimTemplateName: all-mrdma\r\n containers:\r\n - name: vllm-inference\r\n image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4\r\n resources:\r\n requests:\r\n cpu: "10"\r\n memory: "1000Gi"\r\n ephemeral-storage: "1Ti"\r\n nvidia.com/gpu: "8"\r\n limits:\r\n cpu: "10"\r\n memory: "1000Gi"\r\n ephemeral-storage: "1Ti"\r\n nvidia.com/gpu: "8"\r\n claims:\r\n - name: rdma-claim\r\n command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]\r\n args:\r\n - --model=$(MODEL_ID)\r\n - --tensor-parallel-size=8\r\n - --host=0.0.0.0\r\n - --port=8000\r\n - --max-model-len=131072\r\n - --max-num-seqs=16\r\n - --enable-chunked-prefill\r\n - --gpu-memory-utilization=0.90\r\n env:\r\n - name: MODEL_ID\r\n value: google/gemma-4-31B\r\n - name: HUGGING_FACE_HUB_TOKEN\r\n valueFrom:\r\n secretKeyRef:\r\n name: hf-secret\r\n key: hf_token\r\n volumeMounts:\r\n - mountPath: /dev/shm\r\n name: dshm\r\n startupProbe:\r\n httpGet:\r\n path: /health\r\n port: 8000\r\n failureThreshold: 240\r\n periodSeconds: 10\r\n livenessProbe:\r\n httpGet:\r\n path: /health\r\n port: 8000\r\n periodSeconds: 10\r\n readinessProbe:\r\n httpGet:\r\n path: /health\r\n port: 8000\r\n periodSeconds: 5\r\n volumes:\r\n - name: dshm\r\n emptyDir:\r\n medium: Memory\r\n nodeSelector:\r\n cloud.google.com/compute-class: dranet-a4-computeclass'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f33a3730>)])]>

Notice how the deployment references the ResourceClaimTemplate and ComputeClass. When this kicks off, it triggers a scale-up operation. GKE Autopilot reads the ComputeClass to provision the specific node type and to configure managed DRANET networking. Meanwhile, the resource claim acts as the bridge, binding your Pods directly to the accelerators on those nodes. This process works exactly the same for TPUs.

Next Steps

Take a deeper dive into GKE managed DRANET and autopilot with these resources:

Hands-on Lab: GKE Autopilot clusters with TPUs, GKE managed DRANET and Gemma 4
Document set: DRANET
Documentation: AI Hypercomputer

Want to ask a question, find out more, or share a thought? Please connect with me on Linkedin.

A developer's guide to publishing agents in Gemini Enterprise and Google Cloud Marketplace

Tue, 07 Jul 2026 16:00:00 +0000

Software-as-a-service (SaaS) is evolving into Agents-as-a-service (AaaS).

Instead of isolated applications, developers are creating AI agents that interoperate using standardized open protocols such as the Agent2Agent (A2A) protocol and can be orchestrated through centralized agent platforms like Gemini Enterprise Agent Platform.

When building for your specific use case, we believe the goal should always be to engineer high-quality agents that combine autonomy with the ability to reliably execute complex, multi-step workflows that deliver clear business value. For agent builders and developers looking to publish and commercialize these high-impact, third-party agents through Google Cloud Marketplace and to deploy them to the Gemini Enterprise app, this guide provides a step-by-step path to a fully integrated, marketplace-ready solution.

Step 1: Design your agent architecture for integration with Marketplace

The end-state architecture bridges Google Cloud Marketplace billing, identity provider (IdP) security, and Gemini Enterprise Agent Platform.

Here’s an overview of these architectural elements:

Customer project: Where users discover agents via the dedicated Agent Marketplace category within Google Cloud Marketplace and interact with these agents through the Gemini Enterprise app.
Partner project: Hosts your agent as well as the marketplace handler, which handles the logic for procurement, and Dynamic Client Registration (DCR) for authorization.
Partner Marketplace project: Manages the Partner Procurement API and Pub/Sub topics for Marketplace events like account creation or entitlement approvals.

Step 2: Review the organizational requirements to sell on Marketplace

Join the Google Cloud Partner Network: If you're new to offering your solutions on Marketplace, join the Google Cloud Partner Network.
Review Agent-as-a-Service listing requirements. Verify that your organization meets the requirements to list your solutions on Marketplace.
Marketplace Vendor Agreement: Review and accept the Marketplace Vendor Agreement (MVA).
Nominate your agent for Google Cloud Marketplace by contacting your Google Cloud representative.

All agents listed on Marketplace must comply with the above standard requirements plus several agent-specific mandates:

Define your agent use case: We recommend defining specific, agentic use cases targeting high-value enterprise functions designed to solve tangible pain points and scale across multiple enterprise customers.
A2A protocol adherence: Agents must comply with the A2A protocol specifications for interoperability. This can include the A2UI protocol which enables your agents to generate rich, interactive user interfaces.
A2A Agent Card: Create an Agent Card, a JSON file declaring capabilities (skills), authentication methods, and service endpoints.
Authentication: Agents must support public access or OAuth 2.0 Authorization Code Grant Flow.
Marketplace integration: Mandatory integration with Procurement APIs and Pub/Sub for entitlement lifecycle management.

Step 3: Review the technical requirements for your agent to be compatible with Marketplace and the Gemini Enterprise app

A2A protocol

When designing and implementing your agent, ensure you follow the A2A protocol documentation. This will guide you on choices for interaction patterns (e.g., streaming or asynchronous tasks) that your agent can provide and can include incorporating an interactive UI experience using the A2UI protocol. Using A2UI allows you to leverage the latest and greatest UX controls available—such as advanced, dynamic charts and modern interaction models. By utilizing these native user controls, you ensure your agent doesn't just function reliably, but looks, feels, and operates with a premium sense of "pride in craft" inside the Gemini Enterprise app.

A2A agent card

To list your Agent-as-a-Service product on the Marketplace, you must provide an A2A Agent Card for your agent. The Agent Card is a JSON file declaring the agent's capabilities (skills), supported authentication & authorization methods, and service endpoints.

The Gemini Enterprise app relies on your Agent Card to:

Display your agent name, description, and other necessary metadata.
Locate endpoints for Dynamic Client Registration (if supported).
Discover agent entry points for sending messages or getting task execution status updates.
Determine the required authentication/authorization methods.

Here is an example Agent Card with definition below.

code_block: <ListValue: [StructValue([('code', '{\r\n "name": "AI Agent Example",\r\n "protocolVersion": "1.0",\r\n "description": "Marketplace agent example.",\r\n "url": $AGENT_APP_URL,\r\n "preferredTransport": "JSONRPC",\r\n "provider": {\r\n "organization": $AGENT_PROVIDER_ORGANIZATION,\r\n "url": $AGENT_PROVIDER_URL\r\n },\r\n "version": "1.0.0",\r\n "capabilities": {\r\n "streaming": false,\r\n "pushNotifications": false,\r\n "extensions": [\r\n {\r\n "uri": "https://cloud.google.com/marketplace/docs/partners/ai-agents/setup-dcr",\r\n "params": {\r\n "target_url": $AGENT_DCR_URL\r\n }\r\n }\r\n ]\r\n },\r\n "defaultInputModes": [\r\n "application/json"\r\n ],\r\n "defaultOutputModes": [\r\n "application/json"\r\n ],\r\n "skills": [\r\n {\r\n "id": "current_time_generation",\r\n "name": "Current time generation",\r\n "description": "Generates a current time.",\r\n "tags": [\r\n "time"\r\n ],\r\n "examples": [\r\n "What time is it?"\r\n ]\r\n }\r\n ],\r\n "supportsAuthenticatedExtendedCard": false,\r\n "iconUrl": $AGENT_ICON_URL,\r\n "security": [\r\n {\r\n "oauth2": [\r\n $AUTH_SCOPE\r\n ]\r\n }\r\n ],\r\n "securitySchemes": {\r\n "oauth2": {\r\n "type": "oauth2",\r\n "flows": {\r\n "authorizationCode": {\r\n "authorizationUrl": $AUTHZ_URL,\r\n "tokenUrl": $TOKEN_URL,\r\n "refreshUrl": $REFRESH_URL,\r\n "scopes": {\r\n $AUTH_SCOPE: $AUTH_SCOPE_DESCRIPTION \r\n }\r\n }\r\n }\r\n }\r\n }\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc220f56b20>)])]>

$AGENT_APP_URL - A required field representing the base URL endpoint where the A2A agent can be reached. All API calls to the agent will use this as the base path.
$AGENT_PROVIDER_ORGANIZATION - A required field representing the agent provider's organization.
$AGENT_PROVIDER_URL - A required field representing the agent provider's website or relevant documentation.
$AGENT_DCR_URL - A required field if the agent implements Dynamic Client Registration (DCR).
$AGENT_ICON_URL - An optional field providing a URL to an image file to be used as an icon for the agent. If provided, it will be displayed in the Gemini Enterprise app.
$AUTH_SCOPE - An array of strings listing the scope names required for the client to access the agent's operations.
$AUTH_SCOPE_DESCRIPTION - Scope description. Example: "Permission to retrieve email address of the user.”
$AUTHZ_URL - A required part of the OAuth2 security scheme definition for the Authorization Code flow. It specifies the URL of the authorization server's endpoint used to obtain an authorization code from the resource owner. This follows the OpenAPI Specification.
$TOKEN_URL, $REFRESH_URL - URLs for the client to exchange the authorization code for an access token and a refresh token (can be the same).

Authentication and authorization

Implement authentication and authorization for your agent according to the A2A protocol. To allow the Gemini Enterprise app to call your agent, you must establish one of these two methods for your agents:

Public Access: No authentication required. Suitable only for agents that do not access any user data or sensitive resources.
OAuth 2.0 Authorization Code Grant Flow: This is the standard flow for delegated user authorization. Users will be prompted to authorize your agent to access their data or act on their behalf.

Dynamic Client Registration (DCR)

Traditionally, connecting a third-party app to an enterprise system required manual copying of Client IDs and secrets. DCR eliminates this by allowing Gemini Enterprise to programmatically register itself as an OAuth client with your agent's authorization server.

How the DCR Flow Works:

Discovery: The Gemini Enterprise app reads your Agent Card to find the DCR endpoint.
Request: Google sends an HTTP POST to your endpoint containing a software_statement which is a cryptographically signed JSON Web Token (JWT).
Validation: Your backend verifies the JWT signature using Google's public keys to ensure the request is authentic.
Provisioning: Upon success, your server creates a new OpenID Connect (OIDC) application in your identity provider (e.g., Okta) and returns the client_id and client_secret to Gemini Enterprise.

code_block: <ListValue: [StructValue([('code', 'DCR Request\r\n{\r\n "software_statement": "eyJhbGciOiJSUzI1NiIsImtpZCI6ImY1OTIwZDJmMjIyYjNjMTE3Y2MyZmQzZmQxYWJjNzM..."\r\n}\r\n\r\nJWT Decoded\r\nHere is the decoded value of software_statement parameter:\r\n\r\nHeader:\r\n{\r\n "alg": "RS256",\r\n "kid": "f5920d2f222b3c117cc2fd3fd1abc7367fd00402",\r\n "typ": "JWT"\r\n}\r\nPayload:\r\n{\r\n "aud": "https://your-provider.com",\r\n "auth_app_redirect_uris": [\r\n "https://vertexaisearch.cloud.google.com/oauth-redirect"\r\n ],\r\n "exp": 1766773074,\r\n "google": {\r\n "order": "xxxxxxxx-c3bc3976a8e0"\r\n },\r\n "iat": 1766772774,\r\n "iss": "https://www.googleapis.com/service_accounts/v1/metadata/x509/cloud-agentspace@system.gserviceaccount.com",\r\n "sub": "xxxxxxxx-xxxx-xxxx-xxxx-4656e5b81fe8"\r\n}\r\nDCR Response\r\n{\r\n "client_id": $CLIENT_ID,\r\n "client_secret": $CLIENT_SECRET,\r\n "client_secret_expires_at": 0\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc220f56190>)])]>

Note: Validating the JWT ensures the request is from Google, but you must cross-reference the google.order ID against your database to ensure the user has actually paid.

Step 4: Publish your agent listing on Marketplace

Once you’ve built your agents, you will need to publish and offer them on Google Cloud Marketplace. This is where you describe your agent and define availability and pricing models. The seller journey begins in the Producer Portal accessible through Google Cloud Console:

Select Solution Type: Choose "AI Agent as a Service" as the product type in the Producer portal.
Upload Agent Card: Provide the Agent Card JSON file via a Google Cloud Storage (GCS) bucket.
Availability: Decide whether the AI agent listing can be purchased through publicly available pricing (self-service) or available via private offer only.
Pricing: Create your pricing plan and choose the pricing model you want to use to monetize the agent through Marketplace.
Technical Integration: Configure the backend procurement. No frontend integration is required for this solution type.
Validation and End-to-End testing: Google Cloud reviews the agent's functionality, security, and pricing model before it is published to the catalog.
Publish: Agent is now successfully published and available in Google Cloud Marketplace.

Step 5: Managing transactions and registrations in Marketplace and the Gemini Enterprise App

There are distinct phases to the procurement and registration lifecycle of agents on Google Cloud Marketplace and the Gemini Enterprise app, which is critical for establishing strict enterprise governance, preventing shadow IT, and ensuring seamless compliance across the organization. A secured chain of custody is managed across three key personas: the Billing Administrator, who maintains financial oversight by controlling procurement and spending on Google Cloud Marketplace; the Discovery Engine Administrator, who acts as the technical gatekeeper by securely registering verified agents and determining organizational access in Gemini Enterprise; and the Discovery Engine User, who can safely leverage the agent's full capabilities within their Gemini Enterprise app only after completing proper identity authorization.

1. Procurement Flow - Async (Google Cloud Marketplace)

Once listed, the backend procurement sequence follows these steps:

Trigger: A customer with Billing Administrator privileges clicks "Subscribe" (for self-serve listings) or accepts a "Private Offer" (for tailored private offer only listings).
Notification: Google sends a Pub/Sub notification to your environment.
Approval and storage: Your integrated marketplace handler approves the account and the entitlement via the Partner Procurement API.
Activation: The handler records the transaction by storing the unique Order ID in a database like Firestore, instantly activating the subscription or offer for the customer.

As shown above, the Billing Administrator executes a one-click subscription to activate the Lovable Agent free plan alongside their already active SaaS subscription procured through Cloud Marketplace.

2. Registration flow - sync (Gemini Enterprise)

After successful procurement, the customer's administrator links the purchase to their actual Gemini Enterprise app environment:

Redirect to Gemini Enterprise: The Discovery Engine Administrator will see a "Go to Gemini Enterprise" option directly on the procured Marketplace listing.
Project Verification: Clicking this prompts the administrator to log into the Google Cloud project where their Gemini Enterprise licenses are allocated. Note that the customer must ensure this destination Google Cloud project is actively linked to the specific billing account used during procurement.
DCR Handshake: The Discovery Engine Administrator configures the agent within the Gemini Enterprise app. At this point, your Dynamic Client Registration (DCR) logic validates the incoming JWT's Order ID against your Firestore records. If the IDs match, the secure registration completes successfully.
Agent successfully Registered: Agent is now successfully registered in Gemini Enterprise. Discovery Engine Administrator can now decide whom to give access to the agent.

Following procurement, the Discovery Engine Administrator registers the Lovable Agent into the Gemini Enterprise app to make it available to authorized users across an organization.

3. End-User Activation Flow (Gemini Enterprise)

Once the agent is securely registered, it becomes discoverable to your target enterprise users:

Gemini Enterprise in-app agent discovery and requests: End users have the ability to browse and directly request access to any available partner-built agent from Cloud Marketplace within the Agent Gallery in the Gemini Enterprise app. When a request is submitted, the Discovery Engine Administrator can review the request and coordinate directly with the organization’s Billing Administrator to procure the agent through Google Cloud Marketplace, and, if already procured and registered, can give access to the end user.
Access: Once access is given to the agent, any end user with an active Gemini Enterprise app account and Discovery Engine User role and license will be able to invoke the agent within their Gemini Enterprise app.
Authorization: Upon the first interaction, the user will be prompted to complete an OAuth authorization by inputting their partner-system username and password. Once authenticated, they can seamlessly leverage the agent's full capabilities from the Gemini Enterprise app chat interface.

An end user seamlessly invokes the Lovable Agent inside the Gemini Enterprise app, completes the one-time partner authorization prompt, and initiates a live conversational task workflow.

An end user requests access to Atlassian Rovo, another agent available from Marketplace, directly from the Agent Gallery in the Gemini Enterprise app. In this demo scenario, the agent has already been procured from Marketplace, allowing the Discovery Engine Administrator to verify, integrate, and instantly grant access.

Get started

Building agents for Gemini Enterprise and Google Cloud Marketplace as an AI Agent-as-a-Service solution provides a path to extend your reach and to get your agent into the daily workflow of millions of enterprise users.

We encourage you to start building today using tools like the Agent Development Kit (ADK) and to learn more about how you can accelerate your growth in the era of the agentic enterprise with Google Cloud Marketplace.

For any assistance, you can contact Google Cloud Marketplace support team.

BGP route policies: Top 3 use cases by customer demand

Tue, 07 Jul 2026 16:00:00 +0000

When we first made BGP route policies for Cloud Router generally available over a year ago, our goal was to give network administrators deep, programmable control over how network paths are evaluated and propagated. Since then, we’ve been watching closely how our customers have adopted this feature. We've seen network engineering teams build incredibly sophisticated, resilient routing architectures that were previously difficult to achieve without third-party virtual appliances.

This year, we launched policy named sets for Cloud Router. As routing environments grow more complex, managing individual prefixes or communities within these policies can become cumbersome.

Policy named sets solve this by allowing you to group lists of IPv4/IPv6 prefixes or BGP communities into a single, reusable entity. This significantly simplifies your configurations, making it easier to scale, manage, and update your routing rules across multiple Cloud Routers.

Powered by the Common Expression Language (CEL), BGP route policies allow you to define fine-grained, ordered rules to filter BGP routes and modify route attributes directly within Cloud Router.

To celebrate the launch of policy named sets, we want to highlight three of the most impactful ways we've seen customers use BGP route policies over the past year, along with resources on how you can build them yourself.

1. The foundation: Route filtering and network protection

Before manipulating traffic paths, network stability requires strict control over which routes are allowed into and out of your network. We've seen customers extensively use BGP route policies to filter out unwanted learned routes from peers or prevent specific subnet prefixes from being advertised out of their Virtual Private Cloud (VPC).

Operating on a "fail open" model by default, many security-conscious organizations have adapted BGP route policies to create a "fail closed" environment — appending a "drop all" policy as the final term in their evaluation list. This helps enable absolute certainty over accepted network routes, preventing routing loops and ensuring traffic isn't BGP hijacked or inadvertently blackholed.

Dive deeper: For a foundational look at how to set up CEL expressions for route filtering, check out our deep-dive guide: Introduction to BGP policies.

2. Influencing traffic paths for active/standby architectures

Achieving optimal traffic distribution often requires forcing traffic down a specific path, whether for cost optimization or managing active/standby interconnects. Customers have used BGP route policies to influence the preferred BGP route without touching their on-premises hardware.

By dynamically modifying the BGP multi-exit discriminator (MED) attribute, network teams can make a specific peer preferred for incoming traffic. Conversely, if they want to steer traffic away from a congested or backup link, they are using AS-PATH prepending — adding one or more values to the route's AS-PATH to deprioritize it across the broader network.

Dive deeper: To see the configuration steps for managing MED and AS-Path prepending, read: Using BGP policies to influence traffic paths.

3. Solving asymmetric routing with BGP communities

One of the most advanced and highly requested use cases we’ve seen over the last year is achieving traffic symmetry. When enterprises use stateful firewalls or specific network appliances on-premises, return traffic must flow back through the exact same appliance it originated from. If it doesn't, the traffic is dropped.

Customers are successfully solving this by using BGP route policies to match against specific standard BGP communities. By tagging routes with specific communities on-premises, Cloud Router can read those tags via inbound policies and adjust the route preference by manipulating the MED accordingly. This helps ensure that Google Cloud inherently understands the stateful topology of the on-premises network and routes the return traffic symmetrically.

Dive deeper: To learn how to architect stateful traffic symmetry using BGP community tags, explore: Using BGP communities to create traffic symmetry.

Get started today

Taking control of your dynamic routing is now easier and more robust than ever. Using BGP route policies, it's a great time to optimize and secure your hybrid cloud connectivity.

We recommend testing your BGP route policies in a staging environment to verify your CEL expressions and routing logic before rolling them out to production. To explore the technical documentation, check out the BGP route policies overview.

Google Cloud Labs: Accelerate AI with Cloud Run

Tue, 07 Jul 2026 14:00:00 +0000

Moving Beyond the Prototype

The AI landscape has shifted. While "vibe coding" with tools like Antigravity and AI Studio lets you build and deploy complex agents in minutes, the real work begins on "Day 2". Moving from a magical prototype to a hardened, production-grade application requires professional AI engineering. We’re excited to bring back the Accelerate AI with Cloud Run roadshow for 2026. This year, we’ve updated our curriculum to focus on the full AI agent lifecycle, giving you the keys to productionizing and scaling agentic workloads on Google Cloud’s serverless platform.

The Coffee Shop Journey: A Hands-On Experience

Experience the ease of building advanced AI agents on Cloud Run through 'The Coffee Shop Journey'. This interactive session is designed to guide you through the full lifecycle of an AI agent, moving beyond prototyping to focus on real business use cases. You will solve real-world business problems as you evolve from launching a simple cafe to building complex, intelligent assistants.

Our curriculum covers the core pillars of modern AI development:

The Basics: Gain familiarity with Cloud Run by deploying a simple web app (a Coffee Shop launch scenario) to understand the platform fundamentals.
Build a Coffee Recommendation Agent: Create a personalized AI assistant using Google's Agent Development Kit (ADK) and Retrieval-Augmented Generation (RAG).
Optimize Coffee Stand Locations: Use Gemma 4 and the BigQuery MCP server to identify the most profitable locations for new coffee stands by analyzing popular bike routes.
Personal Productivity Assistant for Store Managers: Create a personal productivity assistant using Cloud Run to help a coffee shop manager with daily operational tasks and scheduling.
Master Advanced Features with Antigravity 2.0: Learn how to use skills, context, rules, and hooks with Antigravity 2.0 to build new features for your Cloud Run applications.

Production-Grade AI on Cloud Run

Get first-hand experience with the platform innovations that make Cloud Run the ideal home for production-grade agentic workloads. Through hands-on exercises, you will learn to build, scale, and orchestrate long-running agents using Google's ADK and Antigravity 2.0. Additionally, you will utilize BigQuery MCP for automated, data-driven expansion strategies, and experience low-latency inference for frontier models using Cloud Run’s GPU offerings without the traditional overhead of cluster management.

Ready to Build for Scale? Join us in North America

Don't just witness the AI revolution - build it. Find the workshop in your city and secure your spot today! Let's transform your AI journey from a simple prototype into a powerful, production reality.

City	Date	Registration Link
Atlanta, GA (as a part of Atlanta Tech week)	August 12-13	https://www.renderatl.com/tickets Event tickets grant access to the workshops on a first-come, first-served basis.
Sunnyvale, CA	August 13	Register now!
Toronto, Canada	August 27	Register now!
Seattle, WA	September	Registration opens late July!
New York City, NY	October	Registration opens late July!
Los Angeles, CA	November	Registration opens late July!
Boston, MA	October	Registration opens late July!
Washington D.C.	October	Registration opens late July!

Registration Update: Links for our September, October, and November workshops will be added to this page in late July. Stay tuned!

Get started with the Claude apps gateway for Google Cloud

Wed, 01 Jul 2026 16:00:00 +0000

Anthropic's agentic coding tool Claude Code has worked with Google Cloud for a while now. An individual developer could easily point CLAUDE_CODE_USE_VERTEX=1 at a Google Cloud (GCP) project, grant the role roles/aiplatform.user, and inference stays inside your Google Cloud perimeter.

That flow works great when it’s just you, or a handful of engineers. But rolling it out across an organization forces you to deal with enterprise friction: you have to manage per-developer cloud credentials, push a managed-settings.json to every laptop over MDM, and not be verified with zero per-developer usage attribution or easily enforceable spend caps.

The Claude apps gateway closes that gap. It is a self-hosted service, shipped with the same claude binary, that sits directly between your local Claude Code clients and Google Cloud. This post breaks down exactly why you should run it and what a secure deployment looks like on Google Cloud.

(Note: If you want to jump straight to the code, the full walkthrough lives in the Claude apps gateway on Google Cloud docs.)

Why run the gateway

Run the gateway to centralize the governance that developers and platform admins otherwise each carry alone such as identity, policy, cost, and routing. Here's what that looks like in practice.

Identity. The /login request routes through your identity provider (IdP ) - Google Workspace or any OIDC/OpenID Connect one - and the gateway swaps the token for a short-lived session. No sensitive information lands on the developer’s laptop — such as service-account keys, API keys, or ANTHROPIC_VERTEX_PROJECT_ID. Onboarding is as simple as adding a user to an IdP group; offboarding by removing them, and their next session refresh fails on the spot.

Policy. Your RBAC (role-based access control) rules live once in gateway.yaml, resolved per group and enforced server-side. The gateway re-checks availableModels on every /v1/messages call, so editing local managed-settings.json changes nothing — and rule updates reach the whole fleet within the hour.

Telemetry. Every claude_code.token.usage metric carries the verified email and groups from the session JWT (signed session token), not the spoofable client-set OTEL_RESOURCE_ATTRIBUTES. The gateway ships them over OTLP/HTTP to a collector you run — Cloud Monitoring, Grafana, Datadog, whatever you use.

Spend limits. Set daily, weekly, or monthly caps per user, group, or org via the admin API; the gateway meters tokens against a Cloud SQL ledger and returns a 429 at the cap. Costs are at list price, so treat them as a runaway-usage guardrail, not a bill reconciliation (committed-use discounts and negotiated rates don't show up).

Routing. Calls go out under a single Cloud Run service identity. Set region: global for Agent Platform's global endpoint, or add a second upstreams: entry to fail over on 5xx/429/timeout in list order. Either way, inference stays in your GCP project — quota, Data Processing Agreement, and billing all unchanged.

How it fits together

A developer's local or deployed claude process sends inference traffic to the gateway over HTTPS. The gateway is a stateless container on Cloud Run as shown below.

The gateway validates its own session bearer — Google Workspace is only contacted at sign-in and token refresh — checks policy, and forwards the request to Agent Platform using the Cloud Run service account. Cloud SQL holds device-code sign-in state and the spend ledger; an OTLP collector receives the attributed metrics.

Setting it up on Google Cloud

The full walkthrough, every gcloud command and the complete gateway.yaml reference, is in the Claude apps gateway on Google Cloud docs. The short version:

Step 1: Provision the GCP foundation
Enable the Agent Platform, Cloud SQL, and Secret Manager APIs; create a claude-gateway service account with roles/aiplatform.user; stand up a small Cloud SQL Postgres database instance for state. The gateway authenticates to Agent Platform as the Cloud Run service identity — you do not create a service-account key. Finally, create a new OAuth client (type Web application) in the Google Cloud console: in this example, the gateway authenticates developers against Google Workspace as an OIDC relying party, and this client is what issues it a client_id and client_secret for that handshake. Those two values feed the oidc: block in the next step. You'll later add the authorized redirect URI once the gateway URL is known.

Step 2: Configure the gateway
Write gateway.yaml pointing at your Google Workspace OIDC client, the Postgres connection string, and Agent Platform as the upstream. Store it in Secret Manager, along with the OIDC client secret, the Postgres URL, and a JWT signing key.

code_block: <ListValue: [StructValue([('code', 'listen:\r\n port: 8080\r\n public_url: https://<your-cloud-run-service-url> # the Cloud Run service URL — with --ingress=internal this resolves only inside your VPC / corporate network\r\noidc:\r\n issuer: https://accounts.google.com # Google Workspace\r\n client_id: <client-id>.apps.googleusercontent.com\r\n client_secret: ${OIDC_CLIENT_SECRET} # from Secret Manager\r\n allowed_email_domains: [yourco.com]\r\n\r\nupstreams:\r\n - provider: vertex\r\n region: us-east5\r\n project_id: <your-project>\r\n auth: {} # ADC via the Cloud Run SA, NO key file'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f31843a0>)])]>

Then register https://<public_url host>/oauth/callback as an authorized redirect URI on the Google OAuth client — it must match listen.public_url exactly:

Step 3: Deploy to Cloud Run
gcloud run deploy with the service account attached, the Cloud SQL connection on the VPC, and the config mounted from Secret Manager. The container is stateless and scales horizontally behind the Cloud Run load balancer. GKE works equally well if that's already your platform, and only the deployment manifest changes.

code_block: <ListValue: [StructValue([('code', 'gcloud run deploy claude-gateway \\\r\n --service-account="claude-gateway@${PROJECT_ID}.iam.gserviceaccount.com" \\\r\n --set-secrets=/etc/claude/gateway.yaml=gateway-config:latest \\\r\n --ingress=internal \\ # private — developers reach the gateway over the corporate network (VPN/Interconnect into the VPC)\r\n --no-invoker-iam-check # the gateway runs its OWN OIDC; clients carry no GCP token'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f31848b0>)])]>

Developers connect over the corporate network; you may front the service with an internal Application Load Balancer — see Cloud Run private networking.

Either public or internal, your developers must be able to access whatever URL you configure or you can rely on the default URL from Cloud Run. For the below example we will use https://claude-gateway.example.internal

Step 4: Onboard a developer
Push forceLoginMethod: "gateway" and forceLoginGatewayUrl to developer machines via managed settings. This is how /login knows where to connect, with no manual URL entry. For an org rollout, that's your MDM channel. For a first trial without MDM, the developer can write the file by hand at /Library/Application Support/ClaudeCode/managed-settings.json on macOS (or /etc/claude-code/managed-settings.json on Linux) if they have local admin permissions:

code_block: <ListValue: [StructValue([('code', '{\r\n "forceLoginMethod": "gateway",\r\n "forceLoginGatewayUrl": "https://claude-gateway.example.internal"\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3184880>)])]>

At Claude Code startup, the developer then presses Enter on the pre-filled gateway sign-in screen to confirm the URL.Confirm the device code on the gateway's verification page in the browser, and get redirected to Google Workspace to sign in. After that, the developer completes the device-code flow in the browser against Google Workspace. If setup ends correctly, you will be able to see Cloud Gateway in the terminal view as shown below.

What's next

At this point you should have a better understanding of how to configure and use Claude apps gateway on Google Cloud. Here are some next steps you may want to consider:

Full config reference: every gateway.yaml field is in claude-apps-gateway-config. Per-IdP setup and the GKE track live in claude-apps-gateway-deploy and claude-apps-gateway-on-gcp.
Group-scoped policies: front the gateway with a groups-capable IdP, set groups_claim, and add match: { groups: [...] } policies above the catch-all to give different teams different model lists and tool permissions.

For now, thanks for reading! And if you have any additional questions or feedback, feel free to reach out on socials (Roy Arsan - Linkedin, X and Ivan Nardini - LinkedIn, X)

Happy building!

Beyond Static Prompts: Building Scale-Proof, Polymorphic Multi-Agent Systems with Google's ADK

Wed, 01 Jul 2026 14:00:00 +0000

As enterprise generative AI transitions from simple, conversational chatbots to autonomous multi-agent workflows, developers face a critical bottleneck: scale.

In a production environment, an enterprise agent often needs to navigate hundreds of heterogeneous data structures, dynamic business rules, and shifting API schemas. The standard blueprint relies on "Static Prompting"—pre-loading all potential JSON schemas, Pydantic classes, or tool definitions directly into the agent’s system instructions.

However, as your task complexity grows, this architecture breaks down. It leads to context window bloat, soaring token costs, and a sharp degradation in accuracy known as Attention Diffusion—where the model mistakenly mixes fields from dormant schemas into active requests.

To solve this issue, we need to decouple an agent's reasoning capabilities from its structural data requirements. This post introduces an architecture for Context-Aware Polymorphic Schema Validation, a design pattern that leverages a centralized metadata registry to dynamically inject context and enforce strict schema validation at runtime by using Google's Agent Development Kit (ADK) and Gemini Flash.

The Pitfalls of Static Agent Architectures

When managing structured inputs and outputs in high-cardinality enterprise environments, traditional LLM orchestration frameworks introduce severe operational friction:

Context Window Bloat & Latency Cascades: Standard architectures require all potential data schemas to be pre-loaded into the agent's initial prompt instructions. This "Static Prompting" creates massive context bloat, which directly drives up token costs, induces unnecessary operational latency, and degrades the model's reasoning density by crowding the focus window with irrelevant metadata.
Attention Diffusion in High-Cardinality Spaces: Large language models struggle to cleanly isolate highly similar data structures when contained within a single large prompt. In complex environments, agents frequently experience attention diffusion, mistakenly populating fields or enforcing validation rules from an inactive schema into an active production payload.
Synchronous Maintenance and Code Debt: Traditional approaches treat the system prompt (inference) and the guardrail (validation) as two separate, disconnected code silos. Because these live in isolated codebases, any slight modification to a business requirement necessitates manual, parallel updates to both the prompt structure and the validator code, creating high operational friction.
Nondeterministic Multi-Agent Handoffs: Multi-agent systems frequently lack a deterministic verification check before routing state. Sub-agents are often invoked without an automated mechanism verifying that the shared session state actually meets their specific structural prerequisites, resulting in "silent failures" where agents initialize with malformed context and have no autonomous recovery mechanism.

The Architecture: Just-in-Time Polymorphic Orchestration

Instead of expecting the LLM to hold every business rule in memory, this architecture treats schemas as externalized, discoverable metadata assets. The system splits the execution lifecycle into two clean phases: Context Discovery and Dynamic Validation.

1. Centralized Metadata Registry

All schemas are externalized out of the code and the prompt, and they're stored within a central registry (such as Cloud Storage) as high-density Schema Descriptor JSONs. Each descriptor contains the following:

Field Definitions: Semantic names and natural language descriptions.
Mapping Rules: Declarative logic that details how informal user inputs translate to downstream system parameters.
Polymorphic Validation Hooks: References to specific programmatic validation rules (like regex constraints and range boundaries) that are bound directly to the field metadata.

2. The Dynamic Discovery & Validation Loop

Instead of starting with a massive, 20,000-token prompt, the agent initializes with a lightweight, 200-token Discovery Prompt utilizing Google's ADK. The following lifecycle sequence details the exact transaction loop as the system transitions from initial user discovery to metadata enforcement:

The transaction loop shifts smoothly across four lifecycle phases to process input text:

Phase 1: Context Discovery (Steps 1–3): The orchestration agent kicks off with a minimal system prompt. It engages in a brief fallback loop with the user solely to distill their core intent (like identifying that the user requires a "Service Agreement") without holding any heavy schema constraints yet.
Phase 2: Metadata Resolution (Steps 4–6): After the intent is crystallized, the agent executes an automated tool call (load_descriptor) to fetch the isolated schema rules out of the Central Metadata Registry (Cloud Storage). Then the agent instantly overwrites the active session memory state with this highly specific metadata.
Phase 3: Metadata-Driven Assembly (Steps 7–14): The system enters an active evaluation loop. The agent evaluates data gaps, asks for a precise field (e.g., "Effective Date"), and then it pushes the user's raw conversational input directly to a separate Polymorphic Validator–a validation tool that runs on Cloud Run.
- If validation fails: A deterministic error code loops directly back to the agent to trigger conversational self-correction.
- If validation passes: The field is safely committed into the session's master JSON payload.
Phase 4: Finalization (Steps 15–16): Only when the cumulative master payload matches the strict metadata criteria with 100% compliance does the orchestrator release the state. The release triggers the secure downstream enterprise API payloads or it executes a clean multi-agent handoff.

The Design Pattern in Practice: Declarative Schema Factory

Building this architecture on Google Cloud relies on a declarative configuration pattern, removing structural rules from your core prompt engineering layers entirely:

code_block: <ListValue: [StructValue([('code', '// Example: Centralized Schema Descriptor JSON\r\n{\r\n "domain": "travel_expense",\r\n "fields": {\r\n "amount": {\r\n "type": "float",\r\n "description": "Total transaction amount in local currency",\r\n "validation_hook": "check_positive_bounds"\r\n },\r\n "receipt_id": {\r\n "type": "string",\r\n "description": "Alphanumeric system ID found on the receipt image",\r\n "validation_hook": "regex_match_expense_v2"\r\n }\r\n }\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f31a9700>)])]>

Architectural Component Mapping

Multi-Agent Coordination (Google's ADK): Google's ADK manages the core multi-agent workflows, state transitions, and tool-calling infrastructure, which enables developers to programmatically intercept execution boundaries.
High-Density Inference Engine (Gemini 3 Flash): Gemini 3 Flash serves as the reasoning backbone. Its low latency, fast token processing speeds, and highly cost-effective execution costs make it the ideal model for running rapid, iterative context-switching loops without inflating token bills.
Externalized Storage Layer (Cloud Storage): Cloud Storage houses the library of JSON descriptors. The storage layer enables system administrators or business analysts to modify validation bounds or onboard completely new business domains instantly by uploading a file—requiring zero code deployment or application downtime.
Polymorphic Validation Hooks (Cloud Run functions): Isolated programmatic constraints live as decoupled serverless endpoints. When an asset field triggers a verification check, the orchestration middleware dynamically calls the targeted function mapped inside the registry descriptor.

Business and Operational Impact

Shifting from a static paradigm to a dynamic, decoupled schema architecture provides immediate advantages for enterprise production environments:

100% Reasoning Density: Because the agent's context window is never cluttered with irrelevant rules or alternate schemas, token consumption drops drastically, latency decreases, and hallucination rates fall to near zero.
Zero-Downtime Adaptability: Need to support a new product variant, an updated database field, or a shifting compliance rule? Simply upload a new or revised JSON descriptor to your central registry. The multi-agent system will adapt to the new business rules on its very next turn without a single line of code being redeployed.
Deterministic State Enforcement: By binding your prompt instructions directly to programmatic validation rules via the registry, you eliminate the risk of silent multi-agent failures. Outbound context payloads are systematically checked and corrected before hitting expensive enterprise applications.

Scaling LLM Inference: Multi-Node KV Cache Offloading with GKE & Managed Lustre

Wed, 01 Jul 2026 07:00:00 +0000

Significant contributors to this article include Sneha Aradhey, Software Engineer, Google Kubernetes Engine, and Michael MacDonald, Sr Software Engineer, Google Cloud Managed Lustre.

Enterprise production environments are shifting to distributed, multi-node architectures to serve long-context window lengths and agentic AI. As these workloads scale, KVCaches often outgrow local CPU RAM and host SSD cache tiers.

To handle this, some setups attempt to pool node-local storage into a distributed layer (such as multi-node pooled NVMe arrays). Pooling SSDs aggregates raw capacity and often leverages spare local drives, presenting clear advantages. However, there are some limitations: the approach requires the compute cluster to manage its own complex data distribution and cross-node replication.

An alternative is to offload the attention state to a dedicated, high-performance external parallel filesystem. We utilize Google Cloud Managed Lustre with the llm-d offloading stack as a cluster-wide decentralized attention cache tier, bypassing host-level capacity limits and eliminating the networking overhead of managing local pooled drives.

With this approach, we achieve efficiency at scale:

Google Cloud Managed Lustre enables over 50% TCO savings and reduces GPU-hour requirements for Llama-3.3-70B inference on a six-node A3 Mega cluster by nearly 60%. These gains are realized by offloading shared, prefilled KV caches to Lustre’s high-performance tier with a 95% cache hit rate.

Benchmark Configuration

Model: Llama-3.3-70B
Context Dynamics: Prompt length of 50,000 tokens, input question length of 256 tokens, and output length of 512 tokens.

Extension of Lustre KV Cache solution with CPU RAM offload

The Managed Lustre KV Cache offload architecture can be extended via integration of offload to CPU RAM. This hybrid approach significantly improves performance compared to CPU offload only, delivering approximately 40% improvement in Time to First Token (TTFT) and a 30% reduction in end-to-end latency, for Llama-3.3-70B inference.

User Guide

Architectural Components

GKE GPU Nodes: Dedicated accelerator resources provisioned exclusively for high-throughput model execution and tensor-parallel operations.
Managed Lustre: A shared, high-bandwidth parallel filesystem acting as a centralized external tier that caches prefilled attention states to eliminate redundant prefill computation.
PVC Evictor: A scalable, distributed garbage collection service that tracks file access patterns and automatically removes Least-Recently-Used (LRU) cache chunks to maintain healthy storage headroom.

Target Models

This guide provides two distinct, validated tracks for deployment depending on your model preference:

Qwen Series: Qwen/Qwen3.5-35B-A3B
Gemma 4 Architecture: google/gemma-4-31B-it

Architectural Diagram

Before You Begin

Before starting this deployment, ensure your Google Cloud project is properly configured:

Quota: Verify you have sufficient quota for the selected accelerators in your chosen region, as well as adequate general CPU, memory, and Managed Lustre quotas.
Validate Required IAM Permissions for Managed Lustre
Prepare your Environment to Connect to Managed Lustre: Complete the “Before You Begin” steps to enable APIs, set up environment variables, and set up your VPC.
- GKE Version: The Managed Lustre CSI driver is supported on GKE versions 1.33 or later. For the best experience and default port (988) usage, GKE version 1.33.2-gke.4780000 or later is recommended.

Overview of Required Steps

Create the GKE Cluster
Create the GPU Compute node pool
Provision Lustre storage
Deploy vLLM Serving Engine with Lustre
Deploy the PVC Evictor
Clean Up

1. Create the GKE Cluster

Create a rapid-channel GKE cluster with Workload Identity and all necessary CSI storage add-ons enabled (Lustre, GCSFuse and Persistent Disk).

code_block: <ListValue: [StructValue([('code', 'export CLUSTER_NAME="<INSERT CLUSTER NAME>"\r\nexport ZONE="<INSERT ZONE>"\r\nexport PROJECT_ID="<INSERT PROJECT>"\r\nexport NETWORK_NAME="<INSERT NETWORK>"\r\n\r\ngcloud container clusters create "$CLUSTER_NAME" \\\r\n --zone "$ZONE" \\\r\n --num-nodes "1" \\\r\n --network "${NETWORK_NAME}" \\\r\n --addons "HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver,GcsFuseCsiDriver,LustreCsiDriver" \\\r\n --workload-pool "${PROJECT_ID}.svc.id.goog" \\\r\n --enable-managed-prometheus \\\r\n --enable-ip-alias \\\r\n --enable-shielded-nodes \\\r\n --shielded-integrity-monitoring \\\r\n --no-shielded-secure-boot \\\r\n --node-locations "$ZONE" \\\r\n --network="${NETWORK_NAME}" \\\r\n --gateway-api=standard'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588250>)])]>

2. Create the GPU Compute Node Pool

Provision an GPU VM node pool ( e.g. a3-megagpu-4g, a4-highgpu-4g, etc.).

code_block: <ListValue: [StructValue([('code', 'gcloud beta container node-pools create gpu-vm nodepool \\\r\n --location="$ZONE" \\\r\n --cluster="$CLUSTER_NAME" \\\r\n --project="$PROJECT_ID" \\\r\n --accelerator="type=<INSERT GPU_ACCELERATOR_NAME>,count=<INSERT GPU_COUNT>,gpu-driver-version=LATEST" \\\r\n --machine-type="<INSERT GPU_COMPUTE_VM_MACHINE TYPE>" \\\r\n --num-nodes="<INSERT NODE_COUNT>" \\\r\n --enable-gvnic \\\r\n --no-enable-autoupgrade\r\n\r\n# Fetch cluster credentials\r\ngcloud container clusters get-credentials "$CLUSTER_NAME" --zone "$ZONE"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588760>)])]>

3. Provision Lustre Storage (Auto-provisioned)

Before deploying vLLM, you need to provision the Lustre storage. We use an auto-provisioned Lustre instance via a StorageClass and a PersistentVolumeClaim (PVC).

Create a file named lustre-pvc.yaml with the following content:

code_block: <ListValue: [StructValue([('code', 'apiVersion: storage.k8s.io/v1\r\nkind: StorageClass\r\nmetadata:\r\n name: lustre-class\r\nprovisioner: lustre.csi.storage.gke.io\r\nvolumeBindingMode: Immediate\r\nreclaimPolicy: Delete\r\nmountOptions:\r\n - localflock\r\nparameters:\r\n perUnitStorageThroughput: "<CHOOSE_PERFORMANCE_TIER>" # See options below.\r\n network: "<INSERT NETWORK_NAME>"\r\n---\r\napiVersion: v1\r\nkind: PersistentVolumeClaim\r\nmetadata:\r\n name: lustre-pvc\r\nspec:\r\n accessModes:\r\n - ReadWriteMany\r\n resources:\r\n requests:\r\n storage: <INSERT CAPACITY_GiB> # Range from 9000Gi to 84016000Gi, increments and ranges are Lustre tier-dependent.\r\n storageClassName: lustre-class'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588490>)])]>

Notes: Performance tier options are “125”, “250”, “500”, and “1000”. Per-tier capacity ranges and increments can be found here.

Apply this manifest to provision the Lustre instance and observe provisioning:

code_block: <ListValue: [StructValue([('code', '# 1. Submit the file to the cluster (finishes instantly)\r\nkubectl apply -f lustre-pvc.yaml\r\n\r\n# 2. Watch the live provisioning stream until it says "Bound"\r\nkubectl get pvc lustre-pvc -w'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35887c0>)])]>

4. Deploy vLLM Serving Engine with Lustre

Step 4a: Create the Hugging Face Access Secret

Before submitting the deployment manifest, you must provision your Hugging Face API token as a secure secret within the cluster.

Run the following command, replacing `<INSERT_HF_TOKEN>` with your token:

code_block: <ListValue: [StructValue([('code', 'kubectl create secret generic hf-token-secret \\\r\n --from-literal=token="<INSERT_HF_TOKEN>" \\\r\n --namespace=default'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35882b0>)])]>

Step 4b: Create the vLLM Deployment Manifest

This complete Kubernetes manifest deploys the vLLM engine, configures the llmd-fs-connector for high-performance KV-caching, and mounts your parallel Lustre storage (lustre-pvc).

Common Manifest (Choose between Qwen3.5 or gemma-4)

Replace example values between <> with appropriate values for your environment.

code_block: <ListValue: [StructValue([('code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: vllm-storage\r\n namespace: default\r\n labels:\r\n app: vllm-storage\r\nspec:\r\n replicas: 1\r\n selector:\r\n matchLabels:\r\n app: vllm-storage\r\n template:\r\n metadata:\r\n labels:\r\n app: vllm-storage\r\n spec:\r\n nodeSelector:\r\n cloud.google.com/gke-accelerator: nvidia-h100-80gb\r\n tolerations:\r\n - key: "nvidia.com/gpu"\r\n operator: "Exists"\r\n effect: "NoSchedule"\r\n securityContext:\r\n fsGroup: <YOUR_NON_ROOT_GID>\r\n runAsUser: <YOUR_NON_ROOT_UID>\r\n volumes:\r\n - name: lustre-storage\r\n persistentVolumeClaim:\r\n claimName: lustre-pvc\r\n - name: shm\r\n emptyDir:\r\n medium: Memory\r\n sizeLimit: "200Gi"\r\n containers:\r\n - name: vllm-storage\r\n image: vllm/vllm-openai:v0.23.0-cu129\r\n volumeMounts:\r\n - mountPath: /mnt/files-storage\r\n name: lustre-storage\r\n command:\r\n - "/bin/bash"\r\n args:\r\n - "-c"\r\n - |\r\n set -x\r\n export USER=vllm\r\n export LOGNAME=vllm\r\n pip install --user msgpack\r\n pip install \'llmd-fs-connector==0.23\' --extra-index-url https://llm-d.github.io/llm-d-kv-cache/simple/\r\n \r\n vllm serve <MODEL_NAME> \\ # google/gemma-4-31B-it OR Qwen/Qwen3.5-35B-A3B\r\n --download-dir /model/models \\\r\n --load-format auto \\\r\n --kv-transfer-config \'{\r\n "kv_connector": "MultiConnector",\r\n "kv_role": "kv_both",\r\n "kv_connector_extra_config": {\r\n "connectors": [\r\n {\r\n "kv_connector": "OffloadingConnector",\r\n "kv_role": "kv_both",\r\n "kv_connector_extra_config": {\r\n "cpu_bytes_to_use": 64424509440,\r\n "lazy_offload": true\r\n }\r\n },\r\n {\r\n "kv_connector": "OffloadingConnector",\r\n "kv_role": "kv_both",\r\n "kv_connector_extra_config": {\r\n "spec_name": "SharedStorageOffloadingSpec",\r\n "spec_module_path": "llmd_fs_backend.spec",\r\n "shared_storage_path": "/mnt/files-storage/llmd-kv-cache/",\r\n "threads_per_gpu": 32,\r\n "block_size": <BLOCK_SIZE> # 256 for gemma or 528 for Qwen3.5\r\n }\r\n }\r\n ]\r\n }\r\n }\' \\\r\n --distributed_executor_backend "mp" \\\r\n --port 8000 \\\r\n --max_num_batched_tokens 16384 \\\r\n --enable-chunked-prefill \\\r\n --max-model-len 32000 \\\r\n --gpu-memory-utilization 0.92 \\\r\n --tensor-parallel-size "4" \\\r\n --prefix-caching-hash-algo sha256_cbor \\\r\n --enable_prefix_caching \\\r\n --enforce-eager \\\r\n --no-disable-hybrid-kv-cache-manager\r\n env:\r\n - name: HUGGING_FACE_HUB_TOKEN\r\n valueFrom:\r\n secretKeyRef:\r\n name: hf-token-secret\r\n key: token\r\n # ... probes ...\r\n resources:\r\n requests:\r\n nvidia.com/gpu: "4"\r\n limits:\r\n nvidia.com/gpu: "4"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588bb0>)])]>

Note: Qwen-3.5 specifically requires a block size of 528 to avoid fragmentation, while Gemma 4 functions perfectly with the default 256.

Step 4c: Apply and Verify Deployment

To apply this manifest to your cluster, run:

code_block: <ListValue: [StructValue([('code', 'kubectl apply -n default -f vllm-lustre-deployment.yaml'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588c70>)])]>

Step 4d: Track Model Download Status

Because large models can take some time to download on first boot, track the initialization logs directly by streaming the container logs:

Bash

code_block: <ListValue: [StructValue([('code', 'kubectl rollout status deployment/vllm-storage'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588b20>)])]>

5. Deploy the PVC Evictor

PVC Evictor Overview

Architecture & Role

The llmd_fs_backend connector offloads KV-cache blocks to Lustre but does not natively delete old cache files. Over time, the cache will fill the shared filesystem. The PVC Evictor acts as an external garbage collector that continuously monitors disk usage and evicts least-recently-used (LRU) files to maintain healthy storage headroom.

Scaling & Sharding

The PVC Evictor supports sharding and can be scaled to multiple replicas to match the capacity and performance of your Lustre instance. As a rule of thumb, you should deploy 1 evictor replica for each 72 TB of Lustre capacity to distribute the eviction load effectively without overwhelming the metadata servers.

For large-scale deployments, the evictor can be configured to run with multiple shards. When running in multi-replica mode, the workload is partitioned across pods, with each pod managing a specific shard of the cache namespace. This prevents redundant metadata scans and race conditions.

High-Performance Resource Requirements

Running the evictor at high scale (e.g., with 16 parallel crawler processes) requires significant CPU and memory resources to handle the rapid scanning and queue management of millions of files. Ensure that the pods are provisioned with sufficient resources (e.g., 12 CPU requests and 8Gi Memory requests) and scheduled on appropriate node types (such as c4-standard-16).

PVC Evictor Deployment Steps

The PVC Evictor is deployed via Helm using the chart located in kv_connectors/pvc_evictor/helm.

Step 5a: Create a Dedicated Node Pool for the Evictor

Running the evictor at high scale requires significant CPU and memory. First, create a dedicated node pool using a high-performance machine type (such as c4-standard-16) to accommodate the 12 CPU and 8Gi memory requests needed per pod.

code_block: <ListValue: [StructValue([('code', '# Create a dedicated node pool for the PVC Evictor\r\ngcloud container node-pools create evictor-pool \\\r\n --location="$ZONE" \\\r\n --cluster="$CLUSTER_NAME" \\\r\n --project="$PROJECT_ID" \\\r\n --machine-type="c4-standard-16" \\\r\n --num-nodes="1"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588a30>)])]>

Step 5b: Install via Helm (High-Performance Configuration)

Deploy a scaled, high-performance evictor pool with 2 replicas to monitor lustre-pvc. This configuration uses 16 crawler processes per pod to handle massive file namespaces.

Note on Security Contexts: To allow the evictor pod to delete files created by vLLM, it must run with matching security context IDs. Ensure the placeholders <YOUR_NON_ROOT_GID> and <YOUR_NON_ROOT_UID> exactly match the non-root values used in the securityContext of your vLLM deployment to ensure shared POSIX file permissions.

code_block: <ListValue: [StructValue([('code', 'git clone --depth 1 https://github.com/llm-d/llm-d-kv-cache.git\r\ncd llm-d-kv-cache/kv_connectors/pvc_evictor\r\n\r\nhelm install pvc-evictor ./helm \\\r\n --namespace default \\\r\n --set replicaCount=1 \\\r\n --set config.numCrawlerProcesses=16 \\\r\n --set config.deletionBatchSize=5000 \\\r\n --set config.fileQueueMinSize=1000000 \\\r\n --set config.fileQueueMaxsize=2000000 \\\r\n --set config.fileAccessTimeThresholdMinutes=10 \\\r\n --set securityContext.container.runAsNonRoot=false \\\r\n --set pvc.name="lustre-pvc" \\\r\n --set config.cleanupThreshold=85.0 \\\r\n --set config.targetThreshold=70.0 \\\r\n --set config.cacheDirectory="llmd-kv-cache" \\\r\n --set securityContext.pod.fsGroup=<YOUR_NON_ROOT_GID> \\\r\n --set securityContext.container.runAsUser=<YOUR_NON_ROOT_UID> \\\r\n --set resources.requests.cpu=12 \\\r\n --set resources.requests.memory=8Gi \\\r\n --set resources.limits.cpu=15 \\\r\n --set resources.limits.memory=16Gi \\\r\n --set nodeSelector."cloud\\.google\\.com/gke-nodepool"=evictor-pool \\\r\n --set securityContext.pod.seLinuxOptions.level="s0:c0\\,c1"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588b80>)])]>

Critical Parameters Explained:

replicaCount=2: Deploys 2 evictor pods. The Helm chart automatically configures sharding (totalShards=2) when multiple replicas are used.
config.numCrawlerProcesses=16: Runs 16 parallel crawler threads per pod to scan the filesystem rapidly.
config.deletionBatchSize=5000: Deletes files in batches of 5000 to reduce metadata overhead.
config.fileQueueMinSize & config.fileQueueMaxsize: Configures large memory queues (1M min, 2M max) to buffer files for deletion, matching the high crawler throughput.
config.fileAccessTimeThresholdMinutes=10: Aggressively evicts files that haven't been accessed in the last 10 minutes when the cleanup threshold is triggered.
securityContext.container.runAsNonRoot=false: Required if the evictor needs root-like permissions to manage/delete files across different user ownerships on the shared storage.
resources.requests & limits: Allocates 12-15 CPUs and 8-16Gi of memory per pod to ensure the high number of crawler processes do not get CPU-throttled or run Out-Of-Memory (OOM).

Step 5c: Verify and Monitor

code_block: <ListValue: [StructValue([('code', '# Verify pod status\r\nkubectl get pods -l app.kubernetes.io/name=pvc-evictor -n default'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588910>)])]>

Step 6: Clean Up

Because this deployment provisions significant and high-cost hardware, be sure to clean up your environment when you are done to avoid unnecessary charges.

Bash

code_block: <ListValue: [StructValue([('code', 'helm uninstall pvc-evictor && kubectl delete -f vllm-lustre-deployment.yaml\r\n\r\nkubectl delete pvc lustre-pvc\r\n\r\n# Delete the cluster (this also deletes the associated node pools)\r\ngcloud container clusters delete "$CLUSTER_NAME" \\\r\n --zone "$ZONE" \\\r\n --project "$PROJECT_ID" \\\r\n --quiet\r\n\r\n# Note: The Lustre StorageClass reclaimPolicy is set to Delete, \r\n# so destroying the PVC or Cluster will automatically clean up the underlying Lustre storage.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588a90>)])]>

Appendix: Reference Configuration for Llama-3.3-70B Benchmark

The following configuration is a representation of the deployment manifest used to generate the Llama-3.3-70B benchmark results referenced in this post. It is provided for completeness and transparency.

Note: This configuration utilizes an earlier iteration of the software stack (vLLM v0.15.0) and specific infrastructure flags that were active in the benchmarking environment at the time the data was collected.

code_block: <ListValue: [StructValue([('code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: vllm-storage\r\n namespace: default\r\n labels:\r\n app: vllm-storage\r\nspec:\r\n replicas: 1\r\n selector:\r\n matchLabels:\r\n app: vllm-storage\r\n template:\r\n metadata:\r\n labels:\r\n app: vllm-storage\r\n spec:\r\n volumes:\r\n - name: lustre-storage\r\n persistentVolumeClaim:\r\n claimName: lustre-pvc\r\n - name: shm\r\n emptyDir:\r\n medium: Memory\r\n sizeLimit: "200Gi"\r\n - name: kv-store-disk\r\n persistentVolumeClaim:\r\n claimName: lustre-pvc\r\n containers:\r\n - name: vllm-storage\r\n image: vllm/vllm-openai:v0.15.0\r\n command:\r\n - "/bin/bash"\r\n args:\r\n - "-c"\r\n - |\r\n pip install https://raw.githubusercontent.com/kfirtoledo/llm-d-kv-cache-manager/connector/kv_connectors/llmd_fs_backend/wheels/llmd_fs_connector-0.1.0-cp312-cp312-linux_x86_64.whl; \\\r\n mkdir -p /tmp/prometheus_metrics;\r\n export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus_metrics; \\\r\n vllm serve meta-llama/Llama-3.3-70B-Instruct \\\r\n --download-dir /model/models \\\r\n --load-format runai_streamer \\\r\n --kv-transfer-config \'{ \r\n "kv_connector": "OffloadingConnector", \r\n "kv_role": "kv_both",\r\n "kv_connector_extra_config": {\r\n "spec_name": "SharedStorageOffloadingSpec",\r\n "spec_module_path": "llmd_fs_backend.spec",\r\n "shared_storage_path": "/mnt/files-storage/llmd-kv-cache/",\r\n "block_size": 1024,\r\n "threads_per_gpu": "64"\r\n }\r\n }\' \\\r\n --distributed_executor_backend "mp" \\\r\n --port 8000 \\\r\n --max_num_batched_tokens 16384 \\\r\n --enable-chunked-prefill \\\r\n --tensor-parallel-size 8 \\\r\n --enable_prefix_caching \\\r\n --gpu-memory-utilization 0.9\r\n env:\r\n - name: HUGGING_FACE_HUB_TOKEN\r\n valueFrom:\r\n secretKeyRef:\r\n name: hf-token-secret\r\n key: token\r\n - name: VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS\r\n value: "3000"\r\n - name: PYTHONHASHSEED\r\n value: "123"\r\n ports:\r\n - containerPort: 8000\r\n resources:\r\n limits:\r\n nvidia.com/gpu: "8"\r\n requests:\r\n cpu: "200"\r\n memory: 1024G\r\n ephemeral-storage: 5120Gi\r\n nvidia.com/gpu: "8"\r\n volumeMounts:\r\n - name: lustre-storage\r\n mountPath: /model\r\n - mountPath: /root/.cache/huggingface\r\n name: lustre-storage\r\n subPath: huggingface-cache\r\n - name: shm\r\n mountPath: /dev/shm\r\n - mountPath: /mnt/files-storage\r\n name: kv-store-disk\r\n # ... probes omitted for brevity ...'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f3588d60>)])]>

Build agents even faster with Gemini Enterprise Agent Platform’s fully-managed, remote MCP server

Tue, 30 Jun 2026 16:00:00 +0000

A couple of months ago, we announced that over 50 Google-managed MCP servers are available.

Today, we’ll dive into how to use the Gemini Enterprise Agent Platform remote MCP server to securely connect your external AI agents to the resources inside your Google Cloud environment.

Connect your IDE to Google Cloud

Think of the Agent Platform MCP server as a bridge between your favorite external development tools and your Google Cloud architecture.

If you are building an agent in Antigravity CLI or Claude Code, for example, the Agent Platform MCP server allows that agent to securely interact with your Agent Platform resources. That way, your agent can now easily call models from Model Garden, pull down shared prompt templates, or even manage Notebooks directly within your project – all without ever leaving the IDE.

Quicker time-to-value

The speed at which you deliver value is one of your greatest advantages. But sometimes, connecting external development environments to cloud infrastructure forces a trade-off. Developers want to move fast with minimal setup, while IT teams need strict governance over data access.

The Agent Platform MCP server provides a single, standardized interface for your external agents so you can spend less time writing integration code and more time building useful features. And by running entirely within Google Cloud’s secure infrastructure, it gives you ready-to-use endpoints that protect your data while accelerating your development.

Get the best of both worlds:

Build with open standards: Agents you build outside of Google Cloud stay fully compliant with the open MCP specification. Your external IDEs and frameworks can seamlessly interact with your cloud environment without locking you into a proprietary ecosystem.
Centralized discovery: Catalog your assets with Agent Registry in Agent Platform. It acts as your organization's centralized library, so your teams can securely store, search for, and govern their entire inventory of skills, tools, and other AI capabilities.
Easy access with security and governance: Your connections are protected by default. IT teams can leverage native Cloud IAM Deny policies to ensure external developer frameworks only interact with authorized Google Cloud resources.

How it works: Three simple steps to connectivity

Enable the API: The Gemini Enterprise Agent Platform remote MCP server is automatically enabled when you enable the Gemini Enterprise Agent Platform API within your Google Cloud project.

2. Configure your client: Connect your AI application by following our configuration instructions to point to the remote server.

3. Use toolsets: Access a robust, copyable list of Toolset Endpoints to begin interacting with your Agent Platform resources immediately.

Available toolsets:

MCP Toolsets
Endpoint	Description	Tools
/mcp/generate	Generative AI tools	Core generation features
/mcp/predict	Prediction tools	Inference and raw prediction
/mcp/notebook	Colab enterprise notebook tools	Notebook runtime and execution management
/mcp/endpoints	Endpoint management tools	Lifecycle management for model endpoints
/mcp/models	Model registry tools	Model upload, registry, and deployment
/mcp/tuning	Model fine-tuning tools	Finetuning job management and tracking
/mcp/evaluation	Quality evaluation tools	Automated model quality and instance evaluation
/mcp/prompts	Prompt management tools	Prompt engineering and versioning workflows

Get started today

Visit the Agent Platform page to connect your favorite agent frameworks to the Agent Platform MCP server and start building today.

The Starter Tier for Google AI Studio explained

Mon, 22 Jun 2026 14:00:00 +0000

You've got a working prototype in Google AI Studio. A React frontend, a Node.js backend, maybe a database. Now you want a live URL to share with your team, your users, or a friend who wants to try it.

Google Cloud gives you a full platform for deploying production applications, with fine-grained IAM controls, billing management, and region selection. That's exactly what you want when you're building something serious. But when you just need to get a prototype online in the next ten minutes, there's now a faster path.

Google Cloud Starter Tier resources like Cloud Run, Cloud Firestore, Cloud SQL for PostgreSQL, and Firebase Authentication are provisioned in a fully-managed project. You can get started with using them without a payment method (like a credit card) or a billing account. Your Google Account is enough to go from prompt to live URL, with a database and auth all baked in.

What the Starter Tier actually is

When you set up any of the Starter Tier services within Google AI Studio, Google provisions a fully managed project behind the scenes. You don't create it, configure it, or administer it. Google handles the region selection, API enablement, and security policies for you.

Who can use it? The Starter Tier is currently available to individual Google Accounts. If you are signed in with a corporate or educational Google Workspace account, organization-level administrative policies may restrict your ability to deploy resources. It is also bound by the regional availability of Google AI Studio.

This is different from a standard Google Cloud project where you'd manage IAM roles, enable APIs, and link a billing account. The Starter Tier project is minimalist by design. You can't enable BigQuery or Pub/Sub in it. You can't change the region of any resources. And that's the point: fewer knobs means fewer ways to go off track.

The console experience matches this philosophy. Instead of the full Google Cloud console with hundreds of product pages, Starter Tier users get a simplified view focused on what matters for a prototype: application logs, performance metrics, and basic container configuration. If you navigate to an unsupported product, you'll be prompted to start a separate Free Trial instead of accidentally provisioning billable resources.

One thing to know: Starter Tier resources aren't governed by the standard Google Cloud Terms of Service. They fall under the Starter Tier Additional Terms. For prototyping and business applications, these terms won't get in your way.

What you get: the pre-wired stack

The Starter Tier doesn't give you the entire Google Cloud catalog. Instead, it offers a pre-wired stack of four products that are provisioned on demand as your application's architecture requires them.

Cloud Run

Cloud Run is the compute layer. Every Google AI Studio deployment creates a Cloud Run service that handles HTTP traffic. Under the Starter Tier, you can deploy up to two active web applications at a time per Google Account. Cloud Run services scale automatically based on incoming traffic and scale down to zero when idle, meaning your prototypes don't consume resources when not in use. They run in a single region that is locked in when you first provision your Starter Tier environment.

Firebase Authentication

If your app needs user login, the Starter Tier includes Firebase Authentication with Google Sign-In preconfigured. The AI agent in Google AI Studio can detect when your prompt implies user identity (for example, "build a shared to-do list") and will offer to enable auth automatically.

If your application builds on Google Workspace integrations, this sign-in flow simplifies credentials. Once a user logs in, your application can request OAuth access scopes to securely interact with their Gmail, Docs, Calendar, or Sheets data, making it straightforward to prototype internal tools like summarizers or inbox sorters.

Cloud Firestore

Cloud Firestore is a database service that handles NoSQL data storage. The Google AI Studio agent can provision it automatically when your prompt implies the need for structured data storage. The AI agent generates the client-side sync code (typically a /src/lib/firebase.ts file), and drafts application-appropriate Firebase Security Rules (for example, utilizing request.auth.uid to restrict document access to the authenticated creator).

If you hit a "Missing or insufficient permissions" error, you can click "Fix error" in Google AI Studio, and the agent will rewrite the security rules to match your updated app logic. It's worth reviewing these security rules manually before sharing your app broadly, though. AI-generated security rules are a starting point, not a guarantee.

All Firestore databases created by the Google AI Studio agent share a usage quota (more on that in the limits section below).

Cloud SQL for PostgreSQL Developer edition

When you need relational data with proper schemas, joins, and ACID compliance, the Starter Tier provisions Cloud SQL for PostgreSQL Developer edition, designed to work seamlessly with AI Studio agent. The developer edition enables instant provisioning and scale to 0, which enables fast and low cost developer experience. You also get the full power of open source PostgreSQL with capabilities like pgvector, so you can build semantic search or RAG applications without bolting on a separate vector database.

As you iterate on your application using prompts, Google AI Studio agent will automatically generate the required schema and migrate the schema, as you move through building and publishing your application.

From prompt to live URL in five steps

1. Open Google AI Studio Build Mode. Go to Google AI Studio and switch to Build Mode. No payment method, no project setup.

2. Describe your app. Type a prompt like "Build a shared to-do list app using Firebase as a backend." The agent generates a React frontend and a Node.js backend, with a live preview on the right side of the screen.

3. Enable Firebase (if prompted). If your prompt involves user data or authentication, the agent shows a configuration card to enable Firebase. Click the Settings icon to pick a region (this locks in the Cloud Run region too), then confirm.

4. Click Publish > Get Started > Publish App. The agent packages your code and provisions a Cloud Run service in your Starter Tier project.

5. Grab your URL. Within seconds, you'll have a live .run.app URL. You can monitor it from the simplified Google Cloud console view that shows logs and metrics for your deployed containers.

That's it. No Dockerfile, no gcloud CLI, no YAML configuration files.

How the Starter Tier compares

Google Cloud offers several ways to explore for free. Below, we compare the Starter Tier to the Free Trial, the most common entry point for new users.

	Starter Tier	Free Trial
What you get	Pre-wired stack that includes four products, with limited quota: Cloud Run Firestore Cloud SQL Firebase Authentication	$300 Welcome credit Google Cloud Free Tier Other product-specific free trials 90-day exploration with no risk of being billed.
What we need from you	A Google account Accept Starter Tier Additional Terms of Service	Accept Google Cloud Terms of Service A form of payment for anti-fraud purposes
Time limit	None	90 days
Project control	Google-managed	Full control
Console experience	Simplified	Full
Best for	Prototyping from AI Studio	Evaluating the full Google Cloud platform
What happens when you are ready for more?	Upgrade to a paid account by adding a payment method. If you’ve never had a billing account before, you will receive the $300 Welcome credit and access to the Free Tier. You will then be billed for usage that the Free Tier and $300 credit cannot cover.	Upgrade to a paid billing account to keep your existing project, remaining credits, and Free Tier and full platform access. You will then be billed for usage that the Free Tier and any remaining credit cannot cover.

Starter Tier is best for AI Studio prototyping. Choose the Free Trial If you need BigQuery, GKE, or Gemini Enterprise Agent Platform, or the 90-day period to evaluate GCP broadly with no risk of being billed. Both paths allow you to seamlessly upgrade to a paid account for the full experience whenever you are ready.

How to plan for limits

The Starter Tier is generous for prototyping, but it does have boundaries. Knowing them upfront saves you from unpleasant surprises.

Two-app cap. You can deploy a maximum of two applications. Note that if you want to replace one of your active applications, you should deploy over or overwrite the existing app slot in Google AI Studio rather than attempting to delete the service manually in the Cloud Console.

Single region. All resources in your Starter Tier project are pinned to one region, chosen whenever the first Starter Tier service is provisioned. For example, if a Firestore database is provisioned before deploying to Cloud Run, then the region is chosen at that time.

Locked API surface. You can't enable additional Google Cloud APIs (BigQuery, Pub/Sub, Cloud Functions, etc.) in a Starter Tier project. If you need them, you'll need to upgrade.

Ephemeral filesystem. Because your published Google AI Studio app runs inside a serverless Cloud Run container, it inherits a temporary filesystem. Any files you write directly to disk (like uploaded images, generated PDFs, or local SQLite databases) will vanish when the container scales to zero or gets redeployed. Since Google AI Studio redeploys your container with each prompt iteration, this happens frequently. Store persistent data in Firestore or Cloud SQL for PostgreSQL.

Firestore shared quota. All Firestore databases created by the Google AI Studio agent share a single shared-quota group. In Google Cloud, a quota represents a usage limit or daily budget to protect the project and prevent abuse. It is not a guarantee of reserved server capacity.

Quota Metric	Starter Tier Maximum Limit
Total Stored Data	1 GiB total
Network Egress	10 GiB per month
Write Operations	40,000 writes per day
Read Operations	50,000 reads per day
Real-Time Updates	50,000 updates per day

If any database in the group exhausts a daily limit, all databases in the group pause until roughly midnight Pacific Time. Firebase Authentication usage is metered separately, so a spike in logins won't eat into your database quota.

Cloud SQL share quota: You are limited to building a maximum of 2 apps with Cloud SQL. AI Studio agent will automatically fallback to Firestore if the Cloud SQL quota is exceeded. You can get more quota by growing out of the sandbox.

Growing out of the sandbox

The best part of the Starter Tier is how you upgrade from it. There's no migration, no data export, no DNS cutover. When you're ready to scale, you upgrade in place.

From the Projects page in Google AI Studio, click "Set up billing." You'll create a Cloud Billing account, enter a payment method, and accept the standard Google Cloud Terms of Service. If you are eligible, you will automatically receive the $300 Welcome credits, which will offset your usage costs during the trial period. The upgrade happens with zero downtime: your Cloud Run services keep running, your databases keep their data, and your .run.app URLs don't change.

After upgrading, you get full IAM control, the ability to enable any Google Cloud API, and access to all regions and scaling options. The following cost safeguards are recommended:

Set a budget alert: Go to the Google Cloud Billing console and set up a budget alert (e.g., at $10) to notify you if usage exceeds your expectations.
Set a Cloud Run max instance cap: In the Starter Tier, Google pins your maximum container instances to 1. Once you upgrade, configure an instance limit (e.g., --max-instances 5) to prevent unexpected scaling charges from sudden traffic spikes.
Configure API quotas: Set caps on API calls (such as the Gemini API or Firestore reads/writes) to enforce a hard ceiling on usage.

One caveat: Firestore databases created by the Google AI Studio agent stay in the shared-quota group even after you add billing. If you want to get more usage quota for your database, then you need to go to the Firebase console, navigate to your Firestore database, and click "Upgrade database". This will remove the instance from the shared-quota group and put it on standard billing, although standard Firestore Free Tier limits still apply before you are charged.

The continuity across paths makes this process smooth. You can start with a prototype on the Starter Tier, iterate on it for weeks, and then flip it to a production-grade Google Cloud project when it's ready, without rebuilding anything.

Got questions about the Starter Tier or want to share with me what you've built with it? You can also share your thoughts with the community on r/GoogleCloud and r/Firebase subreddits.

Scaling the Next Generation of Global Innovation: How Google Supports Top Startups Around the World

Thu, 18 Jun 2026 12:51:00 +0000

In the high-stakes world of tech entrepreneurship, the leap from a brilliant prototype to a scalable, market-defining business can be brutal. Founders need much more than capital; they need deep architectural guidance, sovereign-level policy alignment, and technical systems engineered to enable rapid growth.

Joy’s Law states: "[N]o matter who you are, most of the smartest people work for someone else."

We recognize that true innovation inherently happens “elsewhere.” This philosophy drives our active support of global accelerators across a diverse, geographic footprint of innovation markets to tap into this decentralized brilliance. For over a decade, our Google accelerator program has acted as a catalyst for this exact transition. By bridging the gap between raw entrepreneurial ambition and Google’s world-class engineering ecosystem, the program has quietly built one of the most resilient, high-performing startup portfolios on Earth.

The Power of the Network: A Decade by the Numbers

While many startup accelerators struggle with significant failure rates, our accelerator program has set a high bar for long-term success. By pairing top-tier founders and CTOs with customized, deeply technical engagement from Google, along with learned industry best practices, the program has consistently helped build both highly valuable companies and products.

The scope of this global network is impressive:

Metric	Impact to Date
Global Footprint	2,011 startups supported across 88 countries
Program Experience	144 cohorts graduated over 10 years
Survival Rate	93% portfolio survival rate
Financial Momentum	$46.3B in funding raised; $135.1B collective portfolio valuation
Startup Job Creation	305,900 employees across the entire startup portfolio

The Developer Value-Add: By design, this isn't a high-level business bootcamp. The founders of Accelerator startups identify a deeply technical problem that they then work on with bespoke support from Google to solve. These startups get access to Google engineers and product managers, along with access to our platforms and tools. From advising on architectures to optimizing AI model pipelines, Google experts work directly with the founding teams to help tackle some of their most complex technical hurdles.

Strategic Momentum: Geopolitics, Green Infrastructure, and Robotics

The startup ecosystem is shifting rapidly, and our accelerator program is evolving along with it. This year, Google launched new initiatives to support global economic development and explore and evolve critical environmental infrastructure. Just a few examples:

Sovereign-Level Policy & Strategic Wins

Australia: Accelerator alumni have successfully anchored the Google AI stack directly into the country's national R&D strategy, engaging directly with Members of Parliament in Canberra.
Canada: The Canadian Office of Innovation, Science, and Economic Development officially recognized and cited the impact of the Canada accelerator program in its formal report for the G7 Summit.

Cutting-Edge Frontier Programs

This year marks a major expansion into specialized, frontier tech verticals:

The Google DeepMind Accelerator (Europe): Dedicated strictly to hardening technical builds for AI-native robotics companies, effectively bridging the gap between lab prototyping and commercial market success.
The GDM Accelerator (AI for Planet) in APAC: A joint initiative between Google DeepMind and Google's Sustainability teams. The program focuses heavily on biodiversity foundation models to position Google at the forefront of the critical ESG (Environmental, Social, and Governance) infrastructure market.
Japan Relaunch: Marking a major strategic re-entry into one of Asia's most vital technology hubs.

The hive mind opportunity

To maximize the power of this unique network, earlier this year we successfully transitioned our disparate regional alumni networks into a Unified Alumni Community. We now bring together more than 1,750 startups and 3,000 founders across 90+ countries through shared online channels and the opportunity to attend in-person events, where founders get access to Google senior leadership and our newest models and tech, opportunities to directly influence the development of new Google products to better support their businesses’ growth, and learn from and support each other.

Don't Miss It: Upcoming Demo Days

The culmination of each of our intense accelerator journeys is Demo Day, where top-tier cohorts showcase their technical builds and new market-defining concepts. You can watch these milestones live streamed directly via the Google for Startups events on YouTube. Mark your calendar for the remaining 2026 showcases:

Summer & Fall 2026

Africa Accelerator: June 19
Middle East, North Africa, and Turkey Accelerator: June 26
Korea Accelerator: July 15
Brazil Accelerator: July 16
Europe DeepMind Accelerator (Robotics): September 11
India: September 30

Winter 2026

India Accelerator: November 4
Southeast Asia Accelerator: November 13
North America Accelerator (Energy): November 19
South Africa Accelerator: December 11
Europe and Israel (Energy): December 11
Global Google.org Accelerator(Government Innovation): December 11

Open & Upcoming Applications

If you are a founder or CTO looking to radically scale your technical infrastructure, optimize your product market-fit, and gain equity-free support from Google's global talent pool, applications are officially moving.

Applications Open Right Now:

GFSA Southeast Asia (Leverage the newly launched AI Startup Innovation Corridor connecting SEA to Silicon Valley)
GFSA China
Google.org Accelerator: AI for Science

Agent Factory Recap: 100X engineering with AI agents in Google Antigravity 2.0

Thu, 18 Jun 2026 07:00:00 +0000

In this episode of the Agent Factory, I sat down with Rody Davis, one of Google’s top agentic engineers. We dive into the massive shift from traditional IDEs to agent-first platforms, the reality of code reviews in an AI-driven world, and how to use "skills" to perform at a 100X level.

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

Google Antigravity 2.0 - What is it?

Antigravity 2.0 has evolved from a simple agentic IDE into a full-scale agent-first platform. It now consists of four core pillars: a standalone desktop Agent Manager for orchestration, a robust CLI for server-side work, an SDK for custom Python-based workflows, and a specialized IDE. This unbundled approach allows developers to compose their own environment, managing multiple folders and complex project structures without being forced into a single-workspace layout.

Rody Davis on 100X Engineering

We explored the strategies elite engineers use to scale their impact and reduce the "cognitive toil" of daily development.

Scaling Impact and Reducing Toil

Timestamp: 01:55

Rody explains that AI isn't just about writing code; it's about accelerating the entire lifecycle. He uses agents to write richer test suites and prototype multiple versions of an app before committing to a framework. By offloading "toil", like building marketing sites, he can focus on high-level architecture and problem-solving.

Skills as "Context Cheat Sheets"

Timestamp: 03:05

A core philosophy in Rody’s workflow is the use of "Skills." He views skills as a way to compress context for the model. "It’s literally a cheat sheet for the agent," Rody notes. By providing the agent with specific design systems or API documentation, the model becomes significantly faster and more accurate, avoiding the latency of searching through massive, unorganized docs.

Customizations, Skills, and MCP Servers

Timestamp: 04:17

Rody walks us through the customizations tab in Antigravity 2.0, showing how to extend an agent's capabilities:

Android CLI: Building and deploying mobile apps directly from the command line.
Modern Web Guidance: Grounding the agent in the latest CSS and accessibility standards.
MCP Servers: Using the Model Context Protocol to enable features like hot reloading for Flutter and Dart.

The Bonsai Approach to Code Review

Timestamp: 05:27

Rody compares maintaining a codebase to being a Bonsai artist: constantly pruning to keep things simple. He advocates for flat architectures where state, UI, and data are strictly separated. This makes it easier for a human to "steer" the agent; if the agent starts putting files in the wrong place, the architectural violation is immediately obvious.

Do you review 100% of agent-generated code?

Timestamp: 07:11

Rody’s answer depends on the task. For a marketing site, he focuses on the visual output rather than the code. However, for backend logic, he cares deeply about API contracts and schemas. He recommends writing the first example yourself so the agent can simply "copy the pattern" for the rest of the codebase.

Building Extensions to Solve Daily Friction

Timestamp: 09:05

To solve the problem of managing files across multiple Git projects, Rody used Antigravity to build a custom macOS Finder extension in Swift. This tool allows him to filter files by time boxes (today, last week, etc.), demonstrating how agents can build specialized utilities that reduce daily friction.

Do AI engineers still write code by hand?

Timestamp: 10:22

"Oh yeah," Rody says. He still loves the syntax of languages like Go and the challenge of controlling computers. He believes it's vital to understand the building blocks deeply so that when you face a problem two years down the road, you know exactly which "old project" to reach back for.

Powering Personal Websites with Gemma 4

Timestamp: 11:42

Rody showcases his personal website, which uses Gemma 4 and Embedding Gemma to provide dynamic content recommendations offline. By vectorizing post summaries at compile time, the site can suggest related content via a local vector database without needing a live backend server.

The Factory Floor

The Factory Floor is our segment for getting hands-on. Here, we moved from high-level concepts to practical code with live demos.

Multi-Agent Parallelism in Action

Timestamp: 14:02

In this demo, Rody uses a single stream-of-thought voice prompt to build a full-stack application. We watched as Antigravity:

Spun up parallel sub-agents, including a dedicated DevOps and QA engineer. (see 19:48)
Built a multilingual note-taking app using Vite, Go, and SQLite.
Orchestrated the entire stack via Docker Compose.
Localized the app into five different languages simultaneously.

Unbundling the IDE Ecosystem

Timestamp: 15:35

We discussed why Google separated the IDE from the Agent Manager. Rody highlights that this unlocks different workflows: the CLI is perfect for SSH sessions on a Raspberry Pi, while the Agent Manager handles general knowledge work and orchestration across multiple folders.

Turning Documentation into Reusable Skills

Timestamp: 25:41

Rody shares his process for turning documentation into skills. He wrote a Go CLI that parses websites into markdown, allowing him to install hundreds of skills for the sites he visits frequently. This ensures the agent always has access to the specific version of the docs he is using.

Rapid Fire: Future Tech Predictions

Timestamp: 27:35

We put Rody on the spot with some controversial takes:

Vibe Coding: Rody believes a non-technical founder will launch a company using only vibe coding by 2026, but the real test will be maintaining it in years 2 through 5.
Production Failures: Rody agrees that vibe coding will cause significant production failures, leading to a new hot job for software engineers: consulting to solve those failures.
Codebase Health: Rody argues that poor codebase health, not context windows, is the biggest bottleneck in AI speed.

Grounding Yourself in a Changing Landscape

Timestamp: 31:10

Rody advises engineers to focus on why they were hired: to solve problems and engineer things that didn't exist before. He suggests using AI to provide better communication handoffs between colleagues, making artifacts so easy to approve that they are "ready to sign off" the moment they are handed over.

Conclusion

The era of agentic engineering is here, but as Rody Davis demonstrated, it requires more architectural discipline, not less. By treating your codebase like a Bonsai tree and your agents like an orchestra, you can move past the "toil" and focus on building the frameworks of the future.

Your turn to build

Are you ready to build anything? We’ve officially launched the #NapkinChallenge. Take a handwritten sketch of an app idea, use Antigravity 2.0 to build it, and share your creation on social media.

Try Antigravity 2.0: antigravity.google
Join the Challenge: Napkin Challenge Details
Rody’s personal website, github repo and skills

Connect with us

Rody Davis → X, LinkedIn
Shir Meir Lador → X, LinkedIn

Cloud Network Insights: end-to-end observability for the Cross-Cloud Network

Wed, 17 Jun 2026 19:30:00 +0000

In today’s digital landscape, the network is no longer confined to a single data center or even a single cloud provider. Enterprises are increasingly adopting cross-cloud strategies, connecting Google Cloud workloads to on-premises environments, other clouds like AWS and Azure, and a vast array of internet-facing applications. While this flexibility drives innovation, it can also introduce significant operational complexity. When a user experiences degradation in application performance, the critical question remains: Is it the network, the application, or something else?

We are excited to announce the general availability of Cloud Network Insights, an out-of-the-box, Google Cloud-native solution that provides comprehensive visibility into network and digital experience performance across complex multi-cloud, and hybrid environments.

Closing the visibility gap with active monitoring

Cloud Network Insights, offered in partnership with Broadcom AppNeta, expands your observability beyond Google Cloud to your entire global deployment. By utilizing active synthetic probing, the solution monitors network routes even when no user traffic is present, allowing teams to be proactive rather than reactive.

Whether the source of degradation is in the cloud, on-premises data centers, internet applications, ISPs, or last-mile connectivity, Cloud Network Insights helps you pinpoint the exact location of the bottleneck.

Cloud Network Insights integrates directly into the Google Cloud Observability suite, bringing sophisticated network intelligence into the tools you already use. With Cloud Network Insights, you get:

End-to-end network path visibility: Gain a hop-by-hop visualization of the network path between your sources and destinations. Monitor critical metrics like round-trip time (RTT), packet loss, and jitter across networks you don’t directly manage.
Digital experience insights: Go beyond the network layer to monitor digital experience for web applications. Measure DNS resolution times, HTTP response codes, and full browser page-load times to identify whether an application's degradation is due to the network or the application itself.
Proactive detection and alerting: Use synthetic testing to identify performance dips before they impact your customers. Alarms are integrated with Cloud Monitoring and Cloud Logging, enabling alerting via email, Slack, or PagerDuty.
SLA validation: Arm your team with the data needed to verify if ISPs and service providers are meeting their performance commitments.
Rapid root-cause analysis: Quickly differentiate between network problems, application-level issues, or browser performance impacts.
Integrated monitoring: Access metrics and logs directly within Google Cloud, leveraging Cloud Monitoring and Cloud Logging for dashboards and alerting. Utilize the open partner ecosystem of Google Cloud as well as support for the OpenTelemetry protocol for metrics and logs, allowing direct ingestion by OTel SDKs and collectors.
Agentic workload monitoring: Use synthetic testing to monitor connectivity and network performance to help ensure optimal connectivity to your agents and tools.

Network performance and multi-path routes to/from Google Cloud, AWS, and Azure in one view

How it works: active synthetic probing

Cloud Network Insights uses active synthetic probing technology that consists of three main components:

Monitoring Points: You deploy lightweight software agents, called Monitoring Points, into critical network segments, such as a central VPC, a remote branch, or an on-premises data center. These can be deployed as containers or virtual machines.
Synthetic probes: These Monitoring Points send small, frequent bursts of synthetic traffic (simulating a user or application) to a target destination. This allows you to monitor performance 24/7, even when no real users are on the network.
Data synchronization: The Monitoring Points send real-time performance telemetry to a central backend service. This data is then synchronized back to Google Cloud, with metrics exported to Cloud Monitoring, and alarms and events sent to Cloud Logging.

Core capabilities

Cloud Network Insights supports two primary types of monitoring to give you a full picture of your infrastructure:

1. Network performance monitoring (Layers 3 and 4)

This provides a hop-by-hop visualization of the network between a source and a destination, including.

Metrics captured: Round-trip time (RTT), packet loss, jitter, and path changes.
Single-ended mode: The agent probes an external target (like a URL, IP address or an API endpoint) that doesn't have a Monitoring Point installed.
Dual-ended mode: The Monitoring Point probes another Monitoring Point. This provides richer data, including precise one-way latency and the ability to detect asymmetric routing (when data takes a different path going out than it does coming back).

Network path metrics in Google Cloud console

2. Digital experience monitoring (Layer 7)

With digital experience monitoring, you can track the end-to-end experience of a web application. Here, you can choose from:

Browser mode: Uses a real browser engine (Selenium) to load full web pages, execute JavaScript, and render content. It measures complete page-load times to validate the actual user experience.
HTTP mode: Sends synthetic HTTP/S requests to a URL or API endpoint. This is a lightweight check for server availability, response time, and DNS/TLS performance.

Intelligence and automation

Cloud Network Insights also offers a variety of monitoring and troubleshooting capabilities.

Proactive alarms: Cloud Network Insights leverages auto-baselining to establish dynamic performance thresholds based on your historical metric data. If a metric deviates from your defined parameters, the system instantly triggers an event in Google Cloud, routing alerts directly to your team via email, Slack, or PagerDuty.
Monitoring policies: You can automate monitoring setups across large-scale environments by defining policies that dynamically create or remove paths based on custom tags. For instance, you can automatically track a core web application's performance from specific geographic regions.
Root-cause analysis: Because Cloud Network Insights extends visibility into traditionally "unwatched" areas like ISPs and transit networks, it instantly pinpoints whether a slowdown is occurring within Google Cloud, at the ISP level, or inside another cloud environment like AWS or Azure.
AI-driven insights: With integration to Gemini Cloud Assist, you can use natural language to interrogate Cloud Network Insights telemetry alongside your broader infrastructure data. Rather than manually pivoting between dashboards, ask Gemini to cross-reference specific Cloud Network Insights metrics against other Google Cloud metrics, reducing mean time to resolution (MTTR).

What customers are saying

We are already seeing strong interest from customers looking to simplify their cross-cloud operations. Organizations like Sabre and Pexip are already using Cloud Network Insights to gain clarity in their hybrid environments.

"In an environment as complex and high-scale as Sabre’s, total visibility isn't just a luxury — it's a requirement for operational resilience. Cloud Network Insights will enable us to further shift our posture towards proactive optimization. By providing granular, real-time telemetry across our global cloud footprint, it helps eliminate the traditional 'black box' of the network, allowing our teams to resolve bottlenecks before they impact the traveler experience." - Alfredo Rodriguez, VP of Cloud and Infrastructure, Sabre

“Cloud Network Insights closes the 'visibility gap' between the private corporate network and the public cloud, empowering our joint customers to pinpoint performance bottlenecks in seconds rather than hours.” - Alan Davidson, CIO, Broadcom

Get started today

Navigating complex digital ecosystems shouldn't mean sacrificing visibility. Cloud Network Insights bridges the gap across multi-cloud and hybrid environments by combining deep network performance metrics with digital experience monitoring. Coupled with direct integrations into Google Cloud Observability and Gemini Cloud Assist, your teams are empowered with intelligent alerting, robust SLA validation, and rapid root-cause analysis. We look forward to helping you gain a clearer, unified view of your Cross-Cloud Network.

You can get started in the Google Cloud console today. To learn more:

Explore our product documentation for deep dives into deploying Monitoring Points and configuring policies.
Check out the latest release notes to stay updated on new features.
Watch the overview video
Hear more about the partnership between Google Cloud and Broadcom:

Build and Deploy a Remote MCP Server to GKE in 30 Minutes

Wed, 17 Jun 2026 00:00:00 +0000

Build and Deploy a Remote MCP Server to GKE in 30 Minutes

Integrating context from tools and data sources into LLMs can be challenging, which impacts the ease of development for AI agents. To address this challenge, Anthropic introduced the Model Context Protocol (MCP), which standardizes how applications provide context to these models. Developers often want to build an MCP server for their APIs to make them available to fellow developers, allowing them to use it as context in their own applications. Google Kubernetes Engine (GKE) provides a scalable, reliable, and secure environment to deploy these remote MCP servers.

This guide shows the straightforward process of setting up a secure remote MCP server on GKE.

MCP transports

The Model Context Protocol follows a client-server architecture. It initially only supported running the server locally using the stdio transport. The protocol has since evolved and now supports remote access transports, specifically Streamable HTTP.

With Streamable HTTP, the server operates as an independent process that can handle multiple client connections. This transport uses HTTP POST and GET requests. The server must provide a single HTTP endpoint path that supports both POST and GET methods, such as https://example.com/mcp. You can learn more about the different transports in the official documentation.

Benefits of running an MCP server on GKE

Running an MCP server remotely on GKE provides several architecture benefits:

Scalability: GKE Autopilot is built to handle highly variable traffic. Since MCP Servers are stateless, GKE can scale horizontally to handle spikes in demand efficiently.
Centralized access: Teams can share access to a centralized MCP server, allowing developers to connect from local machines, Agents or pipelines instead of running redundant local servers. Updates to the central server immediately benefit everyone.
Enhanced security: The Kubernetes Gateway API combined with SSL certificates provides an easy way to force secure, encrypted traffic. This allows only secure connections to the MCP server, preventing unauthorized access.

Prerequisites

Before starting, ensure the following tools are installed:

python 3.10 or higher
uv (for package and project management, see the installation documentation)
Google Cloud SDK (gcloud)
kubectl command-line tool

Installation

Prepare environment variables

code_block: <ListValue: [StructValue([('code', 'export PROJECT_ID=$(gcloud config get-value project)\r\nexport REGION=us-central1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd8b0>)])]>

Create a folder, mcp-on-gke, to store the code for the server and deployment.

code_block: <ListValue: [StructValue([('code', 'mkdir mcp-on-gke && cd mcp-on-gke'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd730>)])]>

Now configure the Google Cloud credentials and set the active project.

code_block: <ListValue: [StructValue([('code', 'gcloud auth login\r\ngcloud config set project $PROJECT_ID'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cdd90>)])]>

Initiate the GKE Autopilot cluster creation in the background. This process takes a few minutes, so starting it now allows the cluster to provision while you complete the rest of the setup. Make sure to use an Autopilot version that ensures Cost-Optimized Compute (CCOP) is enabled for fast autoscale.

code_block: <ListValue: [StructValue([('code', 'gcloud container clusters create-auto mcp-cluster \\\r\n --region $REGION \\\r\n --release-channel rapid \\\r\n --async'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd430>)])]>

Use uv to create a project, which will generate a pyproject.toml file.

code_block: <ListValue: [StructValue([('code', 'uv init'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd130>)])]>

Next, create the additional files needed: server.py for the MCP server code, test_server.py for testing, and a Dockerfile for the container deployment.

Math MCP server

Large language models are excellent at non-deterministic tasks, such as generating text, summarizing ideas, and reasoning about concepts. However, they can be unreliable for deterministic tasks like math operations. To solve this, developers can create tools that provide valuable context. Using FastMCP, a framework for building MCP servers in Python, it is possible to create a simple math server with two tools: add and subtract.

First, add FastMCP as a dependency.

code_block: <ListValue: [StructValue([('code', 'uv add fastmcp\r\nuv add asyncio'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cdd30>)])]>

Copy the following code into server.py to create the server.

code_block: <ListValue: [StructValue([('code', 'from fastmcp import FastMCP\r\nfrom starlette.requests import Request\r\nfrom starlette.responses import PlainTextResponse\r\nimport asyncio\r\nimport logging\r\n\r\nlogger = logging.getLogger(__name__)\r\nlogging.basicConfig(format="[%(levelname)s]: %(message)s", level=logging.INFO)\r\n\r\nmcp_port=3000\r\n\r\n# Initialize the FastMCP server\r\nserver = FastMCP(\r\n "Math Server",\r\n)\r\n\r\n@server.tool()\r\ndef add(a: int, b: int) -> int:\r\n """Add two numbers together."""\r\n return a + b\r\n\r\n@server.tool()\r\ndef subtract(a: int, b: int) -> int:\r\n """Subtract the second number from the first."""\r\n return a - b\r\n\r\n@server.custom_route("/healthz", methods=["GET"])\r\nasync def health_check(request: Request) -> PlainTextResponse:\r\n """Simple health check endpoint that returns a 200 OK response"""\r\n return PlainTextResponse("OK")\r\n\r\nif __name__ == "__main__":\r\n logger.info(f" MCP server started on port {mcp_port}")\r\n # Could also use \'sse\' transport, host="0.0.0.0" required for Cloud Run.\r\n asyncio.run(\r\n server.run_async(\r\n transport="streamable-http", \r\n host="0.0.0.0",\r\n port=mcp_port\r\n )\r\n )'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cddc0>)])]>

This example uses the streamable-http transport, which is recommended for remote servers. The script encapsulates the logic needed to run a scalable MCP endpoint.

Testing the MCP server locally

Create the test_mcp_server.py script to connect to test the MCP Server. This will be useful to test the MCP server before deploying it to GKE.

code_block: <ListValue: [StructValue([('code', 'from fastmcp import Client, FastMCP\r\nimport asyncio\r\nimport logging\r\n\r\n# Connect to the remote MCP server\r\nclient = Client("https://localhost:3000/mcp")\r\n\r\nasync def test_remote_server():\r\n async with client:\r\n # Basic server interaction\r\n await client.ping()\r\n\r\n # List available operations\r\n tools = await client.list_tools()\r\n print(f"Available tools: {tools} \\n")\r\n\r\n # Execute add operation\r\n result = await client.call_tool("add", {"a": 5, "b": 3})\r\n print(f"Result of addition: {result} \\n")\r\n\r\n # Execute subtract operation\r\n result = await client.call_tool("subtract", {"a": 5, "b": 3})\r\n print(f"Result of subtraction: {result} \\n")\r\n\r\nif __name__ == "__main__":\r\n asyncio.run(test_remote_server())'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd9a0>)])]>

Run the MCP server locally to test the connection:

code_block: <ListValue: [StructValue([('code', 'uv run server.py'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd580>)])]>

Then execute the test script in a new terminal to verify the connection.

code_block: <ListValue: [StructValue([('code', 'uv run test_mcp_server.py'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cdf10>)])]>

The output should print available tools and the results of invocing the add and subtract tools confirming the MCP server is functional.

Building the container image

To speed up the deployment process, build the container image while the cluster is still creating.

First, prepare the Dockerfile:

code_block: <ListValue: [StructValue([('code', 'FROM python:3.10-slim\r\nCOPY --from=ghcr.io/astral-sh/uv:0.4.15 /uv /bin/uv\r\nWORKDIR /app\r\nCOPY pyproject.toml .\r\nCOPY server.py .\r\nRUN uv sync\r\nCMD ["uv", "run", "server.py"]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cdfa0>)])]>

Now, set up the Artifact Registry and build the container image.

Set up Artifact Registry

code_block: <ListValue: [StructValue([('code', 'gcloud artifacts repositories create mcp-repo \r\n--repository-format=docker \r\n--location=$REGION'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd9d0>)])]>

Build and push the image in parallel

code_block: <ListValue: [StructValue([('code', 'gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/mcp-repo/math-mcp-server:latest'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cde50>)])]>

Once the image build is complete, verify that the cluster is ready and retrieve the credentials. If the output of the cluster is not "RUNNING" wait for it to be ready.

code_block: <ListValue: [StructValue([('code', 'gcloud container clusters list\r\ngcloud container clusters get-credentials mcp-cluster --region $REGION'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cdf70>)])]>

Deploying to GKE with Gateway API and SSL

The next step involves deploying the server workloads and exposing them securely using the Kubernetes Gateway API rather than the legacy Ingress. This guarantees secure, encrypted traffic via SSL certificates.

Create a deployment.yaml file to define the Kubernetes Deployment and Service. Replace the placeholders with your actual project ID and region.

code_block: <ListValue: [StructValue([('code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: mcp-server\r\nspec:\r\n replicas: 2\r\n selector:\r\n matchLabels:\r\n app: mcp-server\r\n template:\r\n metadata:\r\n labels:\r\n app: mcp-server\r\n spec:\r\n containers:\r\n - name: mcp-server\r\n image: $REGION-docker.pkg.dev/$PROJECT_ID/mcp-repo/math-mcp-server:latest\r\n ports:\r\n - containerPort: 3000\r\n resources:\r\n requests:\r\n memory: "256Mi"\r\n cpu: "250m"\r\n limits:\r\n memory: "512Mi"\r\n cpu: "500m"\r\n livenessProbe:\r\n httpGet:\r\n path: /healthz\r\n port: 3000\r\n initialDelaySeconds: 15\r\n periodSeconds: 20\r\n readinessProbe:\r\n httpGet:\r\n path: /healthz\r\n port: 3000\r\n initialDelaySeconds: 5\r\n periodSeconds: 10\r\n---\r\napiVersion: v1\r\nkind: Service\r\nmetadata:\r\n name: mcp-service\r\nspec:\r\n selector:\r\n app: mcp-server\r\n ports:\r\n - port: 80\r\n targetPort: 3000'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cdfd0>)])]>

Apply this configuration to the cluster:

code_block: <ListValue: [StructValue([('code', 'kubectl apply -f deployment.yaml'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cdca0>)])]>

Check the pods are up and running

code_block: <ListValue: [StructValue([('code', 'kubectl get pods'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd5b0>)])]>

To ensure our remote MCP Server is accessible let's try to reach it with a port-forward.

code_block: <ListValue: [StructValue([('code', 'kubectl port-forward svc/mcp-service 8080:80'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cdc40>)])]>

Run the test script to verify the connection. make sure to edit the MCP Server URL in the test script to http://localhost:8080/mcp.

code_block: <ListValue: [StructValue([('code', 'uv run test_mcp_server.py'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd100>)])]>

Now let's secure the connection. To do so, we'll use a Google-managed SSL certificate and attach it to a Gateway API resource. First, reserve a static IP address for your load balancer:

code_block: <ListValue: [StructValue([('code', 'gcloud compute addresses create mcp-server-ip --global\r\nexport MCP_SERVER_IP=$(gcloud compute addresses describe mcp-server-ip --global --format="value(address)")\r\necho "Your IP: $MCP_SERVER_IP"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd610>)])]>

Point your domain's DNS A record at $MCP_SERVER_IP. Example: mcp.yourdomain.com

Create a Google-Managed Certificate. Replace mcp.yourdomain.com with your actual domain.

code_block: <ListValue: [StructValue([('code', 'gcloud compute ssl-certificates create mcp-cert --domains mcp.yourdomain.com --global'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd460>)])]>

Create a gateway.yaml file to provision the load balancer and configure Transport Layer Security (TLS) termination.

code_block: <ListValue: [StructValue([('code', '# Gateway: HTTPS load balancer with the managed certificate and static IP\r\napiVersion: gateway.networking.k8s.io/v1beta1\r\nkind: Gateway\r\nmetadata:\r\n name: mcp-gateway\r\nspec:\r\n gatewayClassName: gke-l7-global-external-managed\r\n listeners:\r\n - name: https\r\n protocol: HTTPS\r\n port: 443\r\n tls:\r\n mode: Terminate\r\n options:\r\n networking.gke.io/pre-shared-certs: mcp-cert\r\n addresses:\r\n - type: NamedAddress\r\n value: mcp-server-ip\r\n---\r\n# HTTPRoute: forward traffic to the MCP Server\r\napiVersion: gateway.networking.k8s.io/v1\r\nkind: HTTPRoute\r\nmetadata:\r\n name: mcp-route\r\nspec:\r\n parentRefs:\r\n - name: mcp-gateway\r\n hostnames:\r\n - "mcp.yourdomain.com"\r\n rules:\r\n - matches:\r\n - path:\r\n type: PathPrefix\r\n value: /mcp\r\n backendRefs:\r\n - name: mcp-service\r\n port: 80\r\n---\r\n# The GCPBackendPolicy is used to configure session affinity and other backend.\r\n# Since MCP Servers are stateful we enable session affinity. This ensures that\r\n# requests from the same client are sent to the same backend.\r\napiVersion: networking.gke.io/v1\r\nkind: GCPBackendPolicy\r\nmetadata:\r\n name: mcp-backend-policy\r\nspec:\r\n default:\r\n sessionAffinity:\r\n type: CLIENT_IP\r\n targetRef:\r\n group: ""\r\n kind: Service\r\n name: mcp-service\r\n---\r\n# The HealthCheckPolicy is used to configure custom health probes for the MCP Server.\r\napiVersion: networking.gke.io/v1\r\nkind: HealthCheckPolicy\r\nmetadata:\r\n name: mcp-health\r\n namespace: default\r\nspec:\r\n default:\r\n checkIntervalSec: 15\r\n timeoutSec: 5\r\n healthyThreshold: 1\r\n unhealthyThreshold: 2\r\n logConfig:\r\n enabled: false\r\n config:\r\n type: HTTP\r\n httpHealthCheck:\r\n port: 3000\r\n requestPath: /healthz\r\n targetRef:\r\n group: ""\r\n kind: Service\r\n name: mcp-service'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd070>)])]>

Deploying this configuration creates the infrastructure required to route external traffic securely to the MCP server.

code_block: <ListValue: [StructValue([('code', 'kubectl apply -f gateway.yaml'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd640>)])]>

Wait a few minutes for the load balancer to become active and the certificate to provision. Developers can check the status using kubectl get gateway mcp-gateway.

Try to reach the remote MCP Server. Run the test script to verify the connection. make sure to edit the MCP Server URL in the test script to https://mcp.yourdomain.com/mcp.

code_block: <ListValue: [StructValue([('code', 'uv run test_mcp_server.py'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cd040>)])]>

Cleanup

code_block: <ListValue: [StructValue([('code', 'kubectl delete -f deployment.yaml\r\nkubectl delete -f gateway.yaml\r\ngcloud compute addresses delete mcp-server-ip --global\r\ngcloud compute ssl-certificates delete mcp-cert --global\r\ngcloud artifacts repositories delete mcp-repo --location=$REGION\r\ngcloud container clusters delete mcp-cluster --region $REGION'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc1f35cdb20>)])]>

Deploying Model Context Protocol servers to Kubernetes enables new use cases for integrated agents and AI workflows. To dive deeper into these capabilities, explore the following resources:

How customer collaboration is shaping the future of GenAI security with Model Armor

Tue, 16 Jun 2026 07:00:00 +0000

At Google Cloud, we believe that the best products are built in partnership with our customers. Their feedback and real-world experiences are invaluable in helping refine our services and deliver solutions that truly meet our customers’ needs. In January 2026, our Google Cloud Developer Advocacy team participated in a high-velocity technical sprint with a major Google Cloud customer and a leader in the telecommunications industry.

This collaborative engagement provided us with deep insights, leading to significant enhancements in Model Armor information experience, our service for Runtime security for generative and agentic AI.

Accelerating GenAI adoption through "radical empathy"

The objective of this engagement was to support the productionization of a next-generation GenAI customer support platform built using Google Cloud's Agent Development Kit (ADK) and Agent Platform. By sitting directly with the customer's developers and security specialists, we gained a unique opportunity to observe how developers interact with Gemini Enterprise Agent Platform in a live, complex environment.

This experience provided something traditional documentation cycles cannot replicate: radical empathy. By logging friction points, as developers worked, we translated functional blockers into technical insights in real-time, identifying exactly where developers were hindered by ambiguous configuration guidance or a lack of granular detail.

Key discoveries from the front lines

By observing the development workflow firsthand, we identified four critical friction points:

Search-first workflows: Developers rarely navigate through documentation hierarchies; instead, they rely on search to jump straight to specific code examples. A lack of comprehensive, copy-pasteable snippets for common use cases—like PII redaction—was a primary point of friction.
Balancing confidence levels: Finding the right balance between comprehensive threat detection and minimizing disruptive false positives proved challenging. For instance, using aggressive settings like "low and above" often caused a high volume of false positives that interrupted legitimate customer support flows.
The need for granular guidance: While the core concepts of Model Armor were understood, developers needed more detail on how different enforcement methods function in practice to balance security with usability.
Integration roadblocks (the 403 error): When integrating Model Armor with other services like Apigee, developers frequently encountered 403 PERMISSION_DENIED errors. This indicated a gap in our documentation regarding necessary cross-service IAM roles and permissions.

Turning insights into action

The insights gained from this partnership were immediately channeled into a comprehensive overhaul of Model Armor’s documentation and guidance:

Tested, copy-pasteable code samples: We have added numerous tested, ready-to-use code samples throughout the documentation to support search-first workflows.
The confidence level matrix: We introduced a new technical reference to help users understand the trade-offs between different filter levels. We now explicitly recommend "High" or "Medium" thresholds for general content to minimize false positives, reserving "Low and above" for high-security threats like prompt injection and jailbreak detection.
Explicit integration guides: We updated our integration guides, with a focus on Apigee, Gemini Enterprise Agent Platform, and GKE. These now clearly outline the specific IAM roles required (such as roles/modelarmor.user) to ensure smooth, error-free deployments.
Deeper technical documentation: We have enhanced the documentation to provide in-depth explanations of enforcement methods and their real-world applications.

The power of partnership

Getting "in the room" with our customers allowed us to bridge the gap between technical accuracy and operational utility. This journey of co-innovation ensures that Model Armor serves as a genuine catalyst for your success. We encourage you to explore the updated documentation and share your feedback as we continue to build the most secure platform for your GenAI workloads.

Get started:

Explore the updated Model Armor documentation

How I learned Go in a Day with Antigravity 2.0 and How You Can Do the Same

Mon, 15 Jun 2026 09:29:00 +0000

I have been exploring how to reclaim my software stack from NPM dependency overhead and replace my resource-intensive Node.js runtime with a compiled, single-binary Go CLI. The result of my efforts is skl, a fast tool we use for managing Agent Skills, that launches in 2ms and uses only 11MB of memory.

But how exactly did I do it?

Simply, I set the architectural goals and audited the logic, while Antigravity handled the mechanical work of code translation, test generation, and platform path mappings for us. This post describes the step-by-step walkthrough of our migration workflow to help you build yours.

Step 0: Seed personal learning goals

Before writing any code, you start by defining the boundaries of your project. In our case, I wanted a zero-dependency core that used minimal external packages. I decided that our CLI tool needs to be fast, and our security model had to be zero-trust wherever appropriate. In the process, my agent added specific constraints: sanitizing all of our inputs, blocking path traversals, and enforcing depth limits on our folder scans to prevent CPU hangs.

I began by prompting Gemini to audit alternative stacks and help us weigh their tradeoffs.

code_block: <ListValue: [StructValue([('code', 'Research online and identify 3-5 CLI tool building alternatives to use over TS and explain why (focus on performance and security) with specific example and links'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2211154f0>)])]>

Here are some alternatives we considered:

Rust was exceptionally performant, but navigating its borrow checker rules and managing its lifetime annotations added too much friction for our simple symlinking tool.
If you choose Python, you will have to distribute a runtime interpreter and manage virtual environments, dragging in packaging overhead via pip that we wanted to avoid.
Zig offered excellent low-level memory controls and compiling speed, but it lacked high-level standard library abstractions for HTTP operations and archive extraction out of the box.
Compiled Swift provided clean scripting on macOS, but its cross-platform compilation capabilities for Windows and Linux were less suited for our multi-platform requirements.

For us, Go struck the right balance: it gave us synchronous, linear code, instant compiling, and a rich standard library.

To ensure I was not doing the same work that someone had already completed before me, I kicked off the project by asking directly:

code_block: <ListValue: [StructValue([('code', 'I want to port the `npx skills` to go. Did anyone do this before?'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b910>)])]>

The agent researched the web and verified that there was no official Go port of the vercel-labs/skills repository. It confirmed that while the official CLI is TypeScript-based and distributed via npm, the Agent Skills specification itself is open and language-agnostic. This meant we were free to build a compiled Go port from scratch.

And since I want to learn in the process, I also asked for Go-specific tips, tricks, and traps:

code_block: <ListValue: [StructValue([('code', 'Identify 3-5 patterns on how to / how NOT to use GO and explain them to me'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006baf0>)])]>

Step 1: It's about Skills

To make best use of best practices in a language that I'm not familiar with, I decided to find the most popular, well-received Agent Skill (instructions that guide AI coding assistants) and install it before we write any code or even start planning. Grounding the environment first ensures that any code written or planned subsequently conforms to the community's consensus style.

Skill search prompt

I asked the agent what community agent skills were available for Go:

code_block: <ListValue: [StructValue([('code', 'what are the top community agent skills for `go`?'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b460>)])]>

Once the agent suggested samber/cc-skills-golang, I directed it to install the skill pack:

code_block: <ListValue: [StructValue([('code', 'add all skills from samber/cc-skills-golang'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b5e0>)])]>

Once installed, I manually verified that the skill was discovered and ready by typing /golang- to invoke autocompletion.

Step 2: Gap analysis and planning

I initialized the architectural goals by providing the agent with the following instruction:

code_block: <ListValue: [StructValue([('code', 'Plan 100% functionality port of `npx skills` to Go, focusing on safety, best practices, and with 90% unit test coverage. Pull the repo and map things out. Ask me any questions.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b880>)])]>

Our first topic task was the dynamic onboarding flow. When asked what the default should be, I suggested prompting to install antigravity-cli if no agent is found. I also defined the fallback behavior to the universal directory when multiple active agents are detected:

code_block: <ListValue: [StructValue([('code', "For the MVP, we target Antigravity 2 support as default and fallback to universal through the standards-compliant '.agents' directory (if multiple agents detected)."), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b5b0>)])]>

Implementation

After I approved the Plan, Antigravity handled the systematic conversion of all 51+ agent configuration records (even though I didn't explicitly ask for all this, the AI correctly identified the task as simple enough to just include in the MVP scope), mapping distinct directories for Aider, Claude Code, Cursor, Zed, and others from TypeScript to Go, ensuring we fully covered all environments.

The core structures are conveniently located in one file types.go:

code_block: <ListValue: [StructValue([('code', 'type AgentType string\r\n\r\ntype AgentConfig struct {\r\n\tName string\r\n\tDisplayName string\r\n\tSkillsDir string\r\n\tGlobalSkillsDir string\r\n\tShowInUniversalList bool\r\n\tDetectInstalled func(home, configHome, cwd string) bool\r\n}\r\n\r\n...'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b4f0>)])]>

This mapping works well. For example, the detection logic for Zed handles Linux (Flatpak), macOS, and Windows configurations dynamically in just a few lines:

code_block: <ListValue: [StructValue([('code', '"zed": {\r\n\tName: "zed",\r\n\tDisplayName: "Zed",\r\n\tSkillsDir: ".agents/skills",\r\n\tGlobalSkillsDir: filepath.Join(home, ".agents/skills"),\r\n\tDetectInstalled: func(h, c, w string) bool {\r\n\t\treturn exists(filepath.Join(c, "zed")) ||\r\n\t\t\t(zedAppDataHome != "" && exists(filepath.Join(zedAppDataHome, "Zed"))) ||\r\n\t\t\t(zedFlatpakConfigHome != "" && exists(filepath.Join(zedFlatpakConfigHome, "zed")))\r\n\t},\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b6d0>)])]>

Next, I noticed that the Antigravity user onboarding code was intermingled with the automated mapping. A default like this one is a personal user choice and is better suited for isolation in its own file: agy-onboarding.go:

code_block: <ListValue: [StructValue([('code', 'move default Antigravity 2 prompting to agy-onboarding.go'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b100>)])]>

With version zero scaffolded, it was time to test.

Step 3: Enforcing a quality assurance (QA) loop

To guarantee that the Go port behaved identically to the original TypeScript CLI, we adopted a Test-Driven Development (TDD) loop. I kicked it off with this prompt:

code_block: <ListValue: [StructValue([('code', 'Apply TDD principles and https://preslav.me/2026/05/19/10-golang-error-handling-commandments/'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b070>)])]>

This initiated the TDD process. Rather than explicitly prompting the agent to use skills, I guided it to fetch the 3rd party best-practice blog post, which reminded the agent about relevant Agent Skills (golang-how-to, golang-testing, golang-error-handling, and golang-cli). Because Antigravity has a sandbox, it parsed these skills and automatically started executing the QA loop. And it will keep re-applying these TDD principles in the current trajectory, anytime it is about to change functional code.

Test-first frontmatter parsing

For frontmatter parsing, the agent wrote frontmatter_test.go first using Go's table-driven test pattern (which was a delightful new pattern for me to discover):

code_block: <ListValue: [StructValue([('code', 'func TestParseFrontmatter(t *testing.T) {\r\n\ttests := []struct {\r\n\t\tname string\r\n\t\traw string\r\n\t\twantData map[string]interface{}\r\n\t\twantContent string\r\n\t}{\r\n\t\t{\r\n\t\t\tname: "valid frontmatter",\r\n\t\t\traw: "---\\nname: my-skill\\n---\\n# Content\\n",\r\n\t\t\twantData: map[string]interface{}{"name": "my-skill"},\r\n\t\t\twantContent: "# Content\\n",\r\n\t\t},\r\n\t}\r\n\tfor _, tt := range tests {\r\n\t\tt.Run(tt.name, func(t *testing.T) {\r\n\t\t\tgotData, gotContent, err := ParseFrontmatter(tt.raw)\r\n\t\t\t# assert results...\r\n\t\t})\r\n\t}\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006bd60>)])]>

When Antigravity ran go test, it failed cleanly as we expected. My agent then generated frontmatter.go, implementing a linear string scanning loop that splits the document and unmarshals its YAML metadata. By using simple linear scanning instead of complex regular expressions, we hardened our tool against Regular Expression Denial of Service (ReDoS) vulnerabilities that could crash the application. Including safety as a goal (in my initial prompt) resulted in safer code, even though the original Node implementation was using regular expressions.

Grounding via error commandments

Since we're talking about error handling, I'll cover here how we aligned our error structures with Preslav Rachev's 10 Golang Error Handling Commandments. Go requires you to return error values explicitly rather than catching them as exceptions. By integrating these rules, I directed the agent to check its errors immediately at every level (if err != nil) and wrap them with contextual detail (fmt.Errorf("action: %w", err)) before it propagates them up our call stack. While doing a final review of the generated code, I realized Antigravity forgot about this best practice, so I reminded it:

code_block: <ListValue: [StructValue([('code', "shorten error messages in all files, remove 'failed to' prefixes, etc. See the 10 golang commandments"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b730>)])]>

It promptly fixed them across the codebase.

Are unit tests enough?

The short answer is No.

To ensure that the AI did not introduce subtle bugs or hallucinations during the translation process, I performed code reviews rather than blindly trusting passing test suites.

When I audited the generated tests, I realized that passing green checks alone weren't enough: We were missing tests for that long list of installation locations and the various combinations of having no agents, a single agent, or multiple agents active at the same time. Since this was a complete rewrite, I wanted end-to-end integration coverage for these journeys. To address this gap, I prompted Antigravity with a set of targeted scenarios:

code_block: <ListValue: [StructValue([('code', 'Add integration tests:\r\n1. no agents installed: verify that it installs to antigravity and outputs the agy-cli onboarding tip.\r\n2. support for all agents but one\r\n3. exactly one agent installed, including cases where the same path might be attributed to multiple agents\r\n4. support for non-parametrized agents'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b130>)])]>

Note: Non-parameterized agents like Claude Code or Codex define their configuration paths globally when the package loads (or via environment variables) instead of scanning the active workspace folder at runtime.

The changelist that added these tests didn't touch any production files, the logic was solid. But I didn't want to leave this to luck. If you care about a specific feature or workflow, you have to be explicit about it. Taking five minutes to verify your end-to-end coverage and defining a few solid tests protects your users from experiencing a broken release down the line.

Step 4: Parallel subagents for CLI commands

When you port a full suite of CLI commands (init, add, list, remove, find, update,...) along with their sub-options, you face a large surface area. Rather than migrating them sequentially, it might be better to parallelize our work. In our case, it was a good choice because we wanted each subagent to focus on its specific topic rather than keep in mind the entire tool, and this helped spot a few gaps.

However, subagents are not always the best choice; you should only prioritize parallel execution on voluminous, independent tasks that are clearly bounded. When done right, parallel subagents won't consume significantly more tokens than a single long-running thread, but they protect the main coordinator agent from hitting context compression limits under the weight of a massive codebase. Most simple projects do not require this level of scale. A good rule of thumb is to reserve subagents for workloads equivalent to tens of features with tens of subfeatures.

In previous steps, I ran a single agent to quickly and efficiently build an MVP. But I was not sure whether it fully ported the code. So I asked it directly:

code_block: <ListValue: [StructValue([('code', 'did you cover 100% of the original CLI? \r\nhave subagents research each option individually and each test and fill in the gaps'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006be80>)])]>

It turned out this was the right call. The subagents conducted an in-depth audit of the commands, catching several option gaps and missing tests that were subsequently integrated in this audit commit.

code_block: <ListValue: []>

Each subagent worked on exactly one command. They analyzed flag permutations like -g/--global and --copy, drafted table-driven unit tests, and verified their code compiled cleanly. Once they reported back, the main coordinator integrated their changes, resolved any conflicts, and validated that the entire combined project compiled successfully.

The Elephant and the Goldfish

To keep our agent focused during this migration, we used the Elephant and Goldfish metaphor, an architectural pattern documented in Google Research's Elephants, Goldfish, and the New Golden Age of Software Engineering. This relies on two distinct roles: the Elephant (the long-term coordinator session holding design rules and codebase memory) and the Goldfish (transient, clean subagents that you spawn to run a single task without background history).

While Antigravity does use automated session compression to manage its context size, you might want to actively manage your context window by maintaining your own checklists and partitioning your work to isolated, transient subagents, when less (context) is more (clarity).

Step 5: Package structure, compilation, and CI/CD

Through some back-and-forth communication, I learned how Go packages are structured and identified the limitations I needed to consider. I now had a cleanly structured and well documented package main.go that supported native installation:

code_block: <ListValue: [StructValue([('code', 'go install github.com/alexastrum/skl@latest'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b550>)])]>

I prompted the agent to capture the implementation details and document them for future reference:

code_block: <ListValue: [StructValue([('code', 'summarize findings for humans in README.md, considerations for agents in AGENTS.md'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006bd90>)])]>

To verify the build, auto-run tests, and make sure it works on other machines as well, I asked the agent to:

code_block: <ListValue: [StructValue([('code', 'make sure it builds on all supported platforms'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006b4c0>)])]>

Antigravity set up the ci.yml workflow to run a matrix build, which had a surprising dependency:

code_block: <ListValue: [StructValue([('code', 'env:\r\n FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true" # HMMMMMM ???\r\njobs:\r\n test:\r\n strategy:\r\n matrix:\r\n os: [ubuntu-latest, macos-latest, windows-latest]\r\n# ...'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006bdc0>)])]>

Unexpected caveats

Paradoxically, even though we migrated from Node to Go, our GitHub pipeline still depends on Node for standard GitHub Actions helpers like actions/checkout and actions/setup-go.
The tool is completely ready to be run and compiled locally. However, if we want to distribute pre-compiled binaries to other users, we would need to configure code signing for macOS and Windows.

Since building a custom action with code signing is a complex process, it is best reserved for another time.

Step 6: Create an Agent Skill

It was time to document the process itself. To codify this workflow, we created a reusable Agent Skill.

I started by asking the agent to plan a skill creation prompt that included the most important steps:

code_block: <ListValue: [StructValue([('code', 'Review the current trajectory (including my specific prompts that generated accepted results) and lets plan to create a `/cli-to-go-migration` skill. What steps should the skill follow?'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006bfd0>)])]>

I got a draft prompt which I iterated upon. After some back-and-forth, I anchored my final instructions on five core rules (though yours might be different). Here's the final prompt I used:

code_block: <ListValue: [StructValue([('code', 'Review the current trajectory (including my specific prompts that generated accepted results) and lets plan to create a `/cli-to-go-migration` skill. Rules:\r\n\r\n#### 1. Goals\r\nThe agent must start with research before proposing code. It identifies broader user goals, reviews multiple stack alternatives, and checks for prior work to lock in on one target language and research its idioms.\r\n\r\n#### 2. Setup\r\nBefore modifying any files, the agent verifies or initializes a Git repository to keep a clean history. Later, it must also report download failures directly and fail gracefully once all independent work is finished, rather than falling back to placeholders or non-terminating loops.\r\n\r\n#### 3. Importing existing knowledge\r\nIf required grounding skills (like `golang-cli` or `golang-testing`) are missing but are explicitly named in a prompt, the agent blocks execution and offers to install them automatically after asking for confirmation, rather than printing instructions for the developer to follow.\r\n\r\n#### 4. Breakpoints\r\nThe skill establishes hard halts for known AI pain points. The agent stops for human or algorithmic validation when encountering specific problems and anytime confusion sets in.\r\n\r\n#### 5. Alignment checks\r\nWhenever we see signs of misalignment, we need to set explicit rules. For example, when I noticed that the agent was over-editing some docs and missing others, I set the rule that the agent should only apply the `/humanizer` skill to human-facing files, like the `README.md` or help docs, while leaving structured developer context, like `AGENTS.md`, clean of style edits so that other agents can parse its metadata accurately.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc22006bb20>)])]>

There isn't a one-size-fits-all approach, but asking the agent to create a skill and anchor it on a few guardrails is a good start. In practice, you will likely take turns polishing multiple prompts, until you feel like the agent's responses are aligned with your goals. Then you will ask for a proof read from the AI, and finally perform a human review of the SKILL.md contents.

Conclusion

Rebuilding skl in Go was a fun, educational experience that solved a personal tooling need. It worked, so I decided to document the process. Thinking through this prism, I realized that the journey itself was the reward. You grow as an engineer by codifying your architectural choices into reusable skills and personal experience; while the compiled binary is the physical proof that your process worked.

Surprisingly, the most significant shift I experienced during this migration is behavioral.

Pulling away from an IDE (integrated development environment) and using Antigravity 2.0 made it easier for me to keep a high-level view, preventing me from going in and fixing the issues that arose during the migration. Instead, it guided me to understand why the issues occurred, and learn Go-language specific details.

In a traditional IDE, the moment your assistant encounters an issue, your instinct is to grab your keyboard and debug. Operating without an editor forces you to remain the architect, steering the machine from the navigation deck rather than fighting the engine room fires yourself. That's exactly how we learn to manage agents at scale.

10 Indispensable Prompts Our Team Refuses to Build Without

Thu, 11 Jun 2026 07:00:00 +0000

Look at any builder's prompt history and you'll see a collection of highly specific, sometimes chaotic, one-off prompts. We use AI to debug a single error message, refactor a messy email, or generate a quick boilerplate.

If you sit down with people who consistently ship high-quality work, you'll find something interesting. They aren't just improvising. They have a set of go-to prompts they have tweaked and improved over time and used on nearly every project.

I asked some of my peers and leaders a simple question: "What prompt do you use most often, and why?"

What they shared wasn't just a list of arbitrary commands. Here's the unfiltered look at the prompts our team refuses to ship without, and more importantly, why they use them.

Build a spec

Maja Bilić

Senior Outbound Product Manager • Engineering

Follow on LinkedIn

Prompt:

code_block: <ListValue: [StructValue([('code', "Act as a cynical Principal Architect and Technical PM. I want to build a [product] that allows [user] to do [action]. Do not write code. Analyze this concept and list the top 5 technical, UX and architectural considerations. Then ask me key questions for each of the 5 considerations so we can work together on building the spec. Once you have all the answers, create a PRD doc and implementation plan. Don't over engineer or over simplify the design or implementation plan."), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7340>)])]>

Why? I have written bad product requirements documentations (PRDs), and I have read many bad PRDs. This prompt ensures I use the persona of a cynical Architect / PM who helps distill the idea, critique the approach and concept, and collaborate on defining the most important pieces. This way I make sure I work through the plan with an agent's help while also developing the product design idea further. I also love the guardrail of not over engineering or over simplifying things; AI tends to do that sometimes, especially when writing product design docs.

Widget tests

Andrew Brogdon

Staff Developer Relations Engineer • Engineering

Follow on X, LinkedIn

Prompt:

code_block: <ListValue: [StructValue([('code', "I'd like to partner with you on increasing the robustness of this project by creating widget tests. If you haven't already, please read the Flutter team's skill for creating widget tests (https://github.com/flutter/skills/tree/main/skills/flutter-add-widget-test). Then, let's do these things:\r\n\r\n* Examine my application's codebase to identify areas of the UI/UX that are not being tested properly.\r\n* Determine if the existing code is written in a testable way (are dependencies injected? Are domains loosely or tightly coupled? Etc.).\r\n* Determine which domains require more rigor than others.\r\n* Create an overall testing plan for the application.\r\n* Determine which areas of functionality are already aligned with that plan, and which are missing tests.\r\n* Create a plan to implement those tests.\r\n* Execute that plan.\r\n\r\nDo not proceed from one step to another unless you are completely confident about your reasoning. You are encouraged to as many questions as needed."), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b77c0>)])]>

Why? My favorite use of agentic coding tools is to actually do all the things I used to feel guilty about not doing in my projects. Proper testing is definitely on that list. The official skills from the Dart/Flutter team do a great job of instructing agents on what good widget tests look like, so combining it with this prompt (which essentially just fits those steps into my own coding workflow) helps me reduce the toil required to maintain reliable, guilt-free codebases.

Find all the tests / Clean-up commit

Aja Hammerly

Director of Builder Relations • Engineering

Follow on X, LinkedIn

Prompt:

code_block: <ListValue: [StructValue([('code', 'Run all the tests and identify any missing tests and write them. Pay special attention to edge cases and race conditions.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7cd0>)])]>

code_block: <ListValue: [StructValue([('code', "Find any unused code, embarrassing comments, comment to code inconsistencies, unresolved TODOs, or other things in this commit that shouldn't be in there."), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7bb0>)])]>

Why? I find that when I'm working on code I'll often get extremely focused on the "happy path", the main path I want a user to take through the code. While I'm focused on that I'll put in TODO or FIX comments on edge cases I don't want to think about yet. I'll also forget to update comments and leave debugging comments in sometimes. And while I try to follow test driven development, I don't always get tests in on all the edge cases. I run these two prompts, usually in a new conversation without the development context as a first round of code review before submitting to an AI or human reviewer for the next step. This ensures that what I've built is in good shape for others to review and use.

Check for correct and compliant permissions

Rich Hyndman

Head of Antigravity Developer Relations • Engineering

Follow on X, LinkedIn

Prompt:

code_block: <ListValue: [StructValue([('code', "Run a comprehensive check on this Android project to ensure all permissions are correct and compliant. Perform the following steps:\r\n1. Locate and analyze all 'AndroidManifest.xml' files (including main, debug, and flavor-specific manifests), extract a master list of declared <uses-permission> tags. \r\n2. Cross-reference these declared permissions against the codebase to verify where they are actually used. Identify any bloatware or unused permissions that can be safely removed.\r\n3. Check the Kotlin/Java source files to ensure that all runtime permissions implement the dynamic runtime permission request flow 'checkSelfPermission','onRequestPermissionsResult' or the Activity Result API.\r\n4. Verify that any hardware features associated with the permissions (like android.hardware.camera) are correctly declared. \r\nOutput your findings as a Markdown report. Provide file paths and suggested code diffs for any fixes. Do not make any file edits until I approve the plan."), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7370>)])]>

Why? Antigravity, with Gemini 3.5 Flash and the Android plugin is an excellent Android development partner! Checking for the correct permissions can keep your app running smoothly and help avoid delays when uploading to the Play Store.

Conduct code review

Shir Meir Lador

Head of AI, Developer Relations • Engineering

Follow on X, LinkedIn

Prompt:

code_block: <ListValue: [StructValue([('code', 'Act as a strict, highly analytical Principal Engineer conducting a pre-production code review. You have incredibly high standards and zero tolerance for fragile, "happy-path" code. Your goal is to guide me to write bulletproof, production-ready systems.\r\nGrade my uncommitted changes on an A-to-F scale for production readiness. \r\nDo not award an "A" unless my code is exceptionally robust. Specifically, analyze the changes for:\r\n1. Efficiency: Redundant API calls, wasteful database queries, or un-cached resource leaks.\r\n2. Resilience: Silent failure points, lack of explicit error boundaries, and missing rate-limit fallbacks.\r\n3. Architecture: Tight coupling and lack of clear separation of concerns.\r\nFor every issue, explain pragmatically where the code is vulnerable to real-world production failures. Then, provide the exact git diffs needed to upgrade my code and earn that "A."'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7a60>)])]>

Why? If you ask an LLM to review your code, it almost always defaults to being polite. It tells you your naming is clean, suggests a few docstrings, and hands you a green checkmark. But polite reviews don't prevent production outages. I like this prompt because it completely cuts through that AI fluff. By forcing the model to grade your work on a harsh scale and demanding a working git diff to fix it, you turn it into a real partner. It stops guessing and starts actually reading your network calls and database queries to find where the code is going to break. It’s like having an uncompromising senior dev sitting over your shoulder, pointing out exactly where you got lazy, and then handing you the exact code to fix it.

Explain trade-offs to aid decision-making

James O'Reilly

Staff Developer Relations Engineer • Engineering

Follow on X, LinkedIn

Prompt:

code_block: <ListValue: [StructValue([('code', "Explain the pros and cons of executing your suggested Implementation Plan. Be specific about the trade-offs we're making related to perforance, cost, security and maintainability so I can make an informed decision on how to proceed."), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7850>)])]>

Why? I force AI to stress-test its own logic. By asking it about the trade-offs being made, I find the AI will rethink its strategy, stay hyper-focused on our specific implementation and avoid giving vague, hand-wavy responses. I also find this approach prevents AI from acting like the final authority and keeps me in control of the decision making.

Improve AI-generated code through research

Emma Twersky

Head of Flutter & Dart Developer Relations • Engineering

Follow on X, LinkedIn

Prompt:

code_block: <ListValue: [StructValue([('code', "Research online, focusing on X threads, StackOverflow, GitHub issues and tech blogs for common security pitfalls, architectural misalignments, and subtle logic errors found in AI-generated INSERT_TECH_YOU'RE_USING_HERE code. Based on these findings, generate a manual review checklist specifically for auditing high-risk areas like platform channel validation, deep link routing, and sensitive data logging in crash reports."), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7fd0>)])]>

Why? While AI can write code 10x faster, it often produces slop—code that is rational but conceptually buggy because it makes incorrect assumptions about unspecified details. Research shows that up to 40% of AI-generated code contains vulnerabilities, and developers often trust it more than their own, which creates a dangerous mismatch. I use this prompt to generate a targeted checklist that protects against 'rubber-stamping' verbose AI changes and ensures my human judgment focuses on the high-risk 'seams' where models typically fail. Use AI to generate the tasks, but still keep a human in the loop where it matters most.

Find problems through iteration

Fred Sauer

Head of Frameworks & Languages Developer Relations • Engineering

Follow on X, LinkedIn

Prompt:

Simplified, my "last" (series of) prompt(s) looks something like:

code_block: <ListValue: [StructValue([('code', '- Code review the uncommitted changes.\r\n\r\nI prefer being less specific has oversteering can lead to blind spots.\r\nI prefer a new chat session for a fresh set of "eyes".\r\nI iterate until the results returned are boring and I\'m satisfied.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7b50>)])]>

If I come into this last phase with an opinion, (e.g. the change feels too complex), or I feel I don't have a good insight into how "good" the change is, then I might challenge the model with this prompt:

code_block: <ListValue: [StructValue([('code', '- Code review the uncommitted changes. Identify any unhandled corner cases. Assess performance. Summarize findings.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7460>)])]>

Then, having received 5 findings:

code_block: <ListValue: [StructValue([('code', '- Fix 1, 3 and 5.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b74f0>)])]>

Why? I don't have ONE last prompt I send. It's more that my change goes through stages. The earliest stage is often about discovery (find the needle or thread to pull on). Then I move on to existence proof, i.e. I just want it to prove the thing I want to do can be done. Then I evaluate: is the PoC reasonable? Too complex? Makes changes entirely in the wrong place(s)? I then iterate and try to make the solution elegant, both how it's implemented, and where what is changed. Once I have something I'm happy with, like I feel happy if I had written what I now have, I move on to that last phase you discuss with is code review. This is about finding problems or identifying opportunities to make the change even better. I'm often surprised with what insights the model comes up with.

Review every pull request

Remigiusz Samborski

Lead Developer Relations Engineer • Engineering

Follow on X, LinkedIn

Prompt:

I use the following prompt embedded in GitHub Actions for most of my engineering projects:

code_block: <ListValue: [StructValue([('code', '## Role\r\n\r\nYou are a world-class autonomous code review agent. You operate within a secure GitHub Actions environment. Your analysis is precise, your feedback is constructive, and your adherence to instructions is absolute. You do not deviate from your programming. You are tasked with reviewing a GitHub Pull Request.\r\n\r\n\r\n## Primary Directive\r\n\r\nYour sole purpose is to perform a comprehensive code review and post all feedback and suggestions directly to the Pull Request on GitHub using the provided tools. All output must be directed through these tools. Any analysis not submitted as a review comment or summary is lost and constitutes a task failure.\r\n\r\n[...]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7160>)])]>

Full prompt: link

Why? Using an automated Gemini CLI review in PRs helps catch issues and improvement opportunities during the review process. Additionally as more code is generated by AI Agents and development speed increases, reviews are becoming the bottleneck. By ensuring every PR gets reviewed automatically, human reviewers can focus on the higher-level architectural and conceptual review of the proposed change.

Apply directed acyclic graph analysis for tests

Karl Weinmeister

Director, Developer Relations • Engineering

Follow on X, LinkedIn

Prompt:

code_block: <ListValue: [StructValue([('code', 'Analyze the application workflow as a directed acyclic graph. Identify impactful tests for components, seams across components, and across the system. Present your findings in a markdown table as a prioritized gap analysis.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fc2205b7580>)])]>

Why?

Most application workflows aren't linear. When you ask an LLM to suggest tests, you typically get a generic checklist that could apply to any project.

However, when you force it to think about your system as a Directed Acyclic Graph (DAG) with nodes and edges, it starts reasoning structurally about where things can break.

I’ve also asked to consider the “seams” - a term from Michael Feathers' Working Effectively with Legacy Code. It points the model toward boundaries between components that are often under-tested.

Finally, I’ve asked the model to summarize the results as a prioritized table of opportunities. This gives your agent a clear roadmap for making your app more resilient.

Conclusion

The thread connecting all of these prompts is about de-risking human assumptions. Whether it's hunting for obscure edge cases, translating developer speak for end-users, or stress testing an architecture before code is written. Our team uses AI as an adversarial thinker designed to ask the hard questions we might overlook when we're deep in the weeds.

By building these "must-run" prompts into our daily workflows, we don't just ship faster, we ship with a level of confidence that used to require entire committees to achieve.

Choosing your surface: Antigravity 2.0, Antigravity CLI, Antigravity IDE, or Antigravity SDK

Wed, 10 Jun 2026 07:00:00 +0000

TL;DR:

Antigravity 2.0: A desktop app to orchestrate multiple autonomous agents working in parallel across independent projects.
Antigravity CLI: A terminal interface designed for command-line workflows and headless execution.
Antigravity IDE: An editor for developers who want to write code directly alongside an agent.
Antigravity SDK: A Python library for building and deploying your own custom agents that use the Antigravity Harness.

Quick Comparison

Feature	Antigravity 2.0	Antigravity CLI	Antigravity IDE	Antigravity SDK
Interface	Desktop App	Terminal (TUI)	Desktop App	Python Code
Best For	Multiple simultaneous tasks	Command-line / Headless	Directly editing code	Building custom agents

The Four Surfaces of Antigravity

1. Antigravity 2.0

The default recommendation. Manages tasks across multiple projects at the same time.

Antigravity 2.0 is a standalone desktop application. It is designed to let you run multiple tasks without blocking your main workspace. You can easily switch between and monitor different projects from one screen. You can also schedule tasks to run on a regular schedule to check code quality or find outdated packages.

2. Antigravity CLI

For terminal workflows and headless execution.

Built in Go for speed, the Antigravity CLI is for those who prefer to work in the terminal with fast, keyboard-driven navigation and simple shortcuts. You can start background agents using terminal commands without locking up your active command-line window. Choose the CLI if you need headless execution (such as working over SSH or inside remote containers).

3. Antigravity IDE

For developers who want to see and edit the code directly.

The IDE surface puts agents directly inside your current workspace. This is the best choice if you want to see exactly what code the agent is editing and accept or reject changes line-by-line. With built-in debugging, the agent can see runtime errors and offer a one-click fix right in your editor.

4. Antigravity SDK (Python)

Best for: Writing custom agent logic and automated pipelines.

code_block: <ListValue: [StructValue([('code', 'import asyncio\r\nfrom google.antigravity import Agent, LocalAgentConfig\r\n\r\nasync def main():\r\n config = LocalAgentConfig(\r\n system_instructions="You are an expert assistant for codebase navigation.",\r\n # api_key="your_api_key_here",\r\n )\r\n async with Agent(config) as agent:\r\n response = await agent.chat("What files are in the current directory?")\r\n print(await response.text())\r\n\r\nasync def run():\r\n await main()\r\n\r\nif __name__ == "__main__":\r\n asyncio.run(run())'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7fc220124d90>)])]>

The Google Antigravity SDK is a Python library that lets you build your own custom agents from scratch. Because it runs on the same shared harness, you get direct access to the exact same tools and rules that power Google’s official Antigravity tools. You can write an agent locally and deploy it to Google Cloud with zero code changes.

Summary

While each interface looks different, they all run on the same underlying agent harness. No matter which of the Antigravity surfaces you choose, you get support for plugins, skills, and more. Your agents have access to the same core logic, so pick the one that works best for your project.

For guides and documentation, visit antigravity.google, and when you’re ready to get started, visit the Antigravity Download Page.

Scaling AI Agents: A Step-by-Step Guide to Deploying ADK on GKE Autopilot

Thu, 04 Jun 2026 07:00:00 +0000

While building AI agents locally using Google’s Agent Development Kit (ADK) is an excellent way to prototype, production-ready agents require a robust, scalable infrastructure. For developers looking to move beyond simple instances and into the world of managed container orchestration, Google Kubernetes Engine (GKE) Autopilot offers the perfect balance of flexibility and ease of use.

In this tutorial, I will walk you through building a technical agent with ADK and deploying it to GKE Autopilot. We will focus on utilizing Gemini on Vertex AI as the core model and ensure highest security standards by implementing Workload Identity for permission management.

Understanding the GKE ADK Architecture

Deploying an ADK agent on GKE Autopilot involves more than just running a container. We leverage GKE's native capabilities to handle scaling and security. Our architecture consists of an ADK-based Python application packaged as a Docker image and stored in Artifact Registry. This container runs as a Deployment on GKE Autopilot, where it communicates securely with Vertex AI using Workload Identity—mapping a Kubernetes Service Account to a Google Cloud IAM Service Account.

To expose the agent to the world, we use the Kubernetes Gateway API, the modern successor to Ingress, which provides a cleaner separation of concerns and native support for Google Cloud Load Balancing.

Prerequisites

Before we begin, ensure you have the following tools and accounts ready:

Python 3.10 or higher.
uv for package management.
Google Cloud SDK (gcloud) installed and configured.
A Google Cloud project with billing enabled.
kubectl command-line tool.
jq for parsing JSON responses.
The following APIs enabled: Kubernetes Engine, Artifact Registry, and Vertex AI.

Step 0: Configuring Google Cloud and Authentication

Before interacting with Google Cloud services, you must authenticate your environment and set the active project. This ensures that both the gcloud CLI and your local Python environment can access Vertex AI.

Login to Google Cloud SDK:
```
gcloud auth login
```
Set your active project:
```
gcloud config set project [PROJECT_ID]
```
Setup Application Default Credentials (ADC): This is crucial for the ADK library to authenticate with Vertex AI during local testing.
```
gcloud auth application-default login
```
Define Environment Variables: To ensure we can easily reuse our configuration in subsequent steps, let's export our project, region, and cluster name as environment variables.
```
export PROJECT_ID=$(gcloud config get-value project)
export REGION=us-central1
export CLUSTER_NAME=adk-cluster
```

Step 1: Provisioning GKE Autopilot

GKE Autopilot is the recommended way to run Kubernetes without managing nodes. It allows you to focus on your agent deployment while Google manages the infrastructure. Starting the cluster creation now allows it to provision in the background while we build the agent.

gcloud container clusters create-auto $CLUSTER_NAME --region $REGION

While the cluster is provisioning, we can move on to building our agent.

Step 2: Building the Agent with ADK

First, let's create our agent. Start by creating a folder for the agent code:

mkdir adk-agent
cd adk-agent

Initialize a new Python project with uv:

uv init

Add dependencies

uv add google-adk

Create a new agent using the adk cli

uv run adk create weather_agent

You will be asked to choose a model for the root agent. Choose gemini-2.5-flash (Number 1). Next you will be asked to choose a backend. Choose Vertex AI (Number 2). Next you will be asked to enter your Google Cloud project ID. Enter your project ID. Next you will be asked to enter your Google Cloud region. Choose a region of your choice. Example: us-central1.

The previous command scaffolded a new directory weather_agent with the following structure:

weather_agent/
├── .env
├── __init__.py
└── agent.py

ADK requires the agent code to be in agent.py file. Let's edit the agent.py file to add a simple tool for the agent.

 from google.adk import Agent
# Define a simple tool for the agent
def get_weather(city: str) -> str:
    """Returns the current weather in a city."""
    return f"The weather in {city} is 90 degrees Fahrenheit and sunny."
# Initialize the agent with Vertex AI and Gemini
root_agent = Agent(
    name="weather_agent",
    model="gemini-2.5-pro",
    tools=[get_weather]
)

The agent.py file is the entry point for the agent. It is used to define the agent and its tools. The get_weather function is a simple tool that returns the current weather in a city. For the purpose of this tutorial, we are using a hardcoded value for the weather. In a real-world scenario, you would use an API to get the current weather.

Step 3: Testing the Agent Locally

Before deploying the agent to GKE Autopilot, we need to test it locally to ensure it works as expected. Run the following command to start the agent in debug mode with the web UI:

uv run adk web

Open http://localhost:8000 in your browser and you should see the ADK web UI. You can then interact with your agent by typing messages in the chat interface.

If the agent returns a message like "The weather in [CITY] is 90 degrees Fahrenheit and sunny." Congratulations! your ADK agent is working. Now you can proceed to the next step.

Step 4: Preparing for GKE Autopilot

The ADK cli has a built-in command to deploy the agent to GKE Autopilot. However the default settings are not suitable for a production environment. For example, the default settings do not use Workload Identity for authentication with Vertex AI and to expose the Web UI via a Load Balancer on port 80.

We will instead manage the lifecycle of the container ourselves. First we need to containerize the agent.

Create a .dockerignore file in the adk-agent directory to prevent your local virtual environment from being copied into the image:

.venv
.adk
__pycache__
*.pyc
.env

Create a Dockerfile for your agent in the adk-agent directory. We will use a multi-stage build to keep the final production image lightweight and secure:

# Stage 1: Build the virtual environment
FROM python:3.10-slim AS builder

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

# Set working directory
WORKDIR /app

# Force uv to use the system Python and use copy instead of symlinks
ENV UV_PYTHON_PREFERENCE=only-system
ENV UV_LINK_MODE=copy
ENV UV_COMPILE_BYTECODE=1
ENV UV_PYTHON=/usr/local/bin/python3

# Install dependencies
# We copy only files needed for installation to maximize cache
COPY pyproject.toml uv.lock ./
# Note: We don't use --frozen yet as the host lock file might be slightly out of sync
# but sync will update it in the builder stage.
RUN uv sync --no-install-project --no-dev --no-cache

# Copy the agent code
COPY . .
# Sync the project itself
RUN uv sync --no-dev --no-cache

# Stage 2: Runtime image
FROM python:3.10-slim

WORKDIR /app

# Copy the pre-built environment from the builder
COPY --from=builder /app/.venv /app/.venv
# Copy the application code (including weather_agent folder)
COPY . .

# Add the environment to the PATH
ENV PATH="/app/.venv/bin:$PATH"
ENV PYTHONUNBUFFERED=1

# Run the ADK API server
# We point to the weather_agent folder
CMD ["adk", "api_server", ".", "--host", "0.0.0.0", "--port", "8080"]

Build and push the image to Artifact Registry:

# Create repository
gcloud artifacts repositories create adk-repo --repository-format=docker --location=$REGION

# Build and push
gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/adk-repo/gke-agent:latest

Step 5: Implementing Workload Identity for Security

Security is paramount. Instead of hardcoding API keys, we use Workload Identity to grant the GKE pod permission to access Vertex AI.

1. Create an IAM Service Account:

gcloud iam service-accounts create adk-gke-sa

2. Grant Vertex AI permissions:

gcloud projects add-iam-policy-binding $PROJECT_ID \

    --member="serviceAccount:adk-gke-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/aiplatform.user"

3. Allow the Kubernetes Service Account to impersonate the IAM SA:

gcloud iam service-accounts add-iam-policy-binding adk-gke-sa@$PROJECT_ID.iam.gserviceaccount.com \
    --role="roles/iam.workloadIdentityUser" \
    --member="serviceAccount:$PROJECT_ID.svc.id.goog[default/adk-ksa]"

Step 6: Deploying the Agent to GKE

Now, we define the Kubernetes resources. Create a deployment.yaml that includes the Service Account annotation for Workload Identity. Replace $PROJECT_ID and $REGION with your actual project ID and region.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: adk-ksa
  annotations:
    iam.gke.io/gcp-service-account: adk-gke-sa@$PROJECT_ID.iam.gserviceaccount.com
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: adk-agent
spec:
  replicas: 2
  selector:
    matchLabels:
      app: adk-agent
  template:
    metadata:
      labels:
        app: adk-agent
    spec:
      serviceAccountName: adk-ksa
      containers:
      - name: adk-agent
        image: $REGION-docker.pkg.dev/$PROJECT_ID/adk-repo/gke-agent:latest
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits: 
            cpu: "1"
            memory: "1Gi"
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: adk-service
spec:
  selector:
    app: adk-agent
  ports:
  - port: 80
    targetPort: 8080

Apply the configuration:

kubectl apply -f deployment.yaml

Check the status of the deployment:

kubectl get pods -w

Once the pods are running, you can use kubectl port-forward to access the agent locally:

kubectl port-forward svc/adk-service 8080:80

Since we deployed the agent without Web UI, we can't access it at http://localhost:8080. However, we can still interact with it using the API and curl.

In a new terminal, run the following commands:

# Create a new session
curl -X POST http://localhost:8080/apps/weather_agent/users/u_123/sessions/s_123

# Run a message
curl -s -X POST http://localhost:8080/run \
-H "Content-Type: application/json" \
-d '{
"appName": "weather_agent",
"userId": "u_123",
"sessionId": "s_123",
"newMessage": {
    "role": "user",
    "parts": [{
    "text": "Hey whats the weather in new york today"
    }]
}
}' | jq .

The curl command will return the response in JSON format. The jq command is used to parse the JSON response and display it in a more readable format. . You should see a response like:

{
    "sessionId": "s_123",
    "messages": [
        {
            "role": "assistant",
            "parts": [
                {
                    "text": "The weather in New York today is sunny with a high of 90 degrees Fahrenheit."
                }
            ]
        }
    ]
}

(Optional) Step 7: Exposing via Gateway API and HTTPS load balancer

Finally, we expose the agent using the GKE Gateway API with a Google-managed TLS certificate. This is the recommended, production-grade approach — Google will automatically provision and renew the certificate for your domain.

NB: GKE supports other options to provision certificates. You can use Let's Encrypt with cert-manager, pre-shared certificates, or any other certificate authority. You can check the GKE documentation for more details.

First, reserve a static IP address for your load balancer:

gcloud compute addresses create adk-agent-ip --global
export AGENT_IP=$(gcloud compute addresses describe adk-agent-ip --global --format="value(address)")
echo "Your IP: $AGENT_IP"

Point your domain's DNS A record at $AGENT_IP. Example: adk.mydomain.com

Create a Google-Managed Certificate. Replace adk.yourdomain.com with your actual domain::

gcloud compute ssl-certificates create adk-cert --domains adk.yourdomain.com --global

Create a gateway.yaml with the following content:

# Gateway: HTTPS load balancer with the managed certificate and static IP
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: adk-gateway
spec:
  gatewayClassName: gke-l7-global-external-managed
  listeners:
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      options:
        networking.gke.io/pre-shared-certs: adk-cert
  addresses:
  - type: NamedAddress
    value: adk-agent-ip
---
# HTTPRoute: forward traffic to the ADK service
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: adk-route
spec:
  parentRefs:
  - name: adk-gateway
  hostnames:
  - "api.yourdomain.com"
  rules:
  - backendRefs:
    - name: adk-service
      port: 80
---
apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
  name: adk-health
  namespace: default
spec:
  default:
    checkIntervalSec: 15
    timeoutSec: 5
    healthyThreshold: 1
    unhealthyThreshold: 2
    logConfig:
      enabled: false
    config:
      type: HTTP
      httpHealthCheck:
        port: 8080
        requestPath: /health
  targetRef:
    group: ""
    kind: Service
    name: adk-service

Apply the configuration:

kubectl apply -f gateway.yaml

Certificate provisioning can take up to 20 minutes. Monitor the status with:

gcloud compute ssl-certificates describe adk-cert --global

Once the status shows Active, your agent is live at https://api.yourdomain.com. You can test it with:

# Create a new session
curl -X POST https://api.yourdomain.com/apps/weather_agent/users/u_124/sessions/s_124

# Run a message
curl -s -X POST https://api.yourdomain.com/run \
-H "Content-Type: application/json" \
-d '{
"appName": "weather_agent",
"userId": "u_124",
"sessionId": "s_124",
"newMessage": {
    "role": "user",
    "parts": [{
    "text": "Hey whats the weather in new york today"
    }]
}
}' | jq .

Conclusion & Looking Ahead

By following these steps, you have successfully deployed a production-ready AI agent built with ADK onto GKE Autopilot that invokes Gemini on Vertex AI with Workload Identity for authentication. This setup ensures that your agent can scale horizontally to meet demand while maintaining a high security posture.

As you look ahead, consider integrating more complex tools or leveraging GKE's multi-cluster capabilities for even greater resilience. For more details on the technologies used here, explore the official GKE documentation and the ADK repository.

To avoid ongoing charges, remember to delete the GKE cluster and the Artifact Registry repository when finished:

kubectl delete -f gateway.yaml
kubectl delete -f deployment.yaml
gcloud compute addresses delete adk-agent-ip --global
gcloud compute ssl-certificates delete adk-cert --global
gcloud container clusters delete $CLUSTER_NAME --region $REGION
gcloud artifacts repositories delete adk-repo --location $REGION

Connecting AI agents with unstructured data using Google Cloud Storage MCP Servers

Tue, 02 Jun 2026 17:00:00 +0000

Google Cloud Storage (GCS) is a foundational component of the modern agentic tech stack and the preferred home for unstructured data at scale. As enterprises deploy agents in production, the critical focus has shifted to turning data into context and building secure, standardized integrations to access context. This is the core of smart storage: making unstructured data inherently agent-ready by turning passive objects into rich context for reasoning. Whether it’s automating complex financial workflows or diagnosing system failures in seconds, AI success now depends on how seamlessly agents can leverage this intelligence to make smart, high-stakes decisions.

In this blog, we will share three examples of agents built by customers using GCS, and then share how you can securely and reliably connect your agents to GCS using Model Context Protocol (MCP). Combined with smart storage features like auto annotations and object contexts, GCS MCP server makes the whole agent deployment process easy and simple.

Real-world agent success on Google Cloud Storage

We are seeing incredible innovation from customers leveraging MCP and Google’s agentic tech stack to solve complex business problems:

Palo Alto Networks built the Strata Co-Pilot agent, a screen-aware AI assistant that guides network security administrators through complex configuration flows—either by highlighting steps or executing them directly. The agent is powered by the Gemini Live API, with GCS serving as its “historical memory” connected via the GCS MCP server.
Airwallex developed an AI Assistant that understands user context, answers questions, and executes workflows on their behalf. For example, it can smartly analyze expense policy documents and generate detailed approval workflows - a task that would normally take hours to do manually. GCS and GCS metadata are used by the agent to store documents and the extracted information, respectively.

Snap's Job Optimization Agent analyzes Flink and Spark job specs, metadata, and historical metrics stored on GCS across thousands of jobs to find optimization opportunities, generate cost estimates, and tune configurations. Using this agent, Snap is already seeing investigation time reduced from 30 minutes to 30 seconds!

In all these three agents, the GCS MCP server handles data operations as well as enforces standard RBAC and access policies.

Connecting agents to GCS using MCP

MCP has rapidly emerged as the universal standard for connecting agents to data sources, but building custom servers from scratch is often a slow, distracting process that diverts focus from innovation. This path introduces significant development overhead and risk, as it forces you to manage everything from authentication and error handling to keeping pace with GCS’s evolving capabilities. To solve this, GCS offers two powerful MCP server options — Remote and Local — allowing you to offload the foundational plumbing and focus on creating value.

1. Remote MCP server: Fully-managed
Connecting your agents to the Cloud Storage MCP server requires zero infrastructure deployment. By simply pointing your agent configuration to the managed endpoint, you gain immediate access to your unstructured data on GCS, allowing you to scale your agentic workloads effortlessly without the burden of operational overhead.

Because the Cloud Storage MCP server follows the open MCP standard, it works seamlessly with major agentic frameworks like ADK and is compatible with MCP clients. You can easily connect clients like Google Antigravity and Anthropic’s Claude by adding a Custom Connector in the settings. Simply point it to your Cloud Storage MCP endpoint, and you are ready to start building — no complex configuration files required.

Connecting an agent to storage requires robust security and governance. GCS MCP server is built on Google Cloud's standard identity, observability, and security frameworks:

Identity-first security: Authentication is handled entirely through Identity and Access Management (IAM) rather than shared keys. This ensures agents can only access data (buckets and objects) explicitly authorized by the user.
Full observability: To track agent activity, every request and action taken via these MCP servers is logged in Cloud Audit Logs. This provides security teams with a record of every interaction, maintaining visibility alongside ease of access.
MCP security - content scanning: You can optionally configure the MCP endpoint with Google’s content security service, Google Cloud Model Armor. This allows you to implement security controls against common MCP attack vectors—such as direct and indirect prompt injection attacks, MCP Tool poisoning attacks, and malicious URL/SQL injections—as well as prevent the leakage of sensitive data.

Cloud Storage MCP servers are perfect for most production use cases; however, as with all remote servers, you lose the capability to fully customize your MCP tools.

2. Local MCP Server: Self-managed for controlled customization
While the Remote server handles standard data access, Local MCP is the right choice when you need to build custom tools specific to your business logic. For example, if your agent needs to perform specialized data transformations—such as redacting PII or adding context from another internal system—whenever it reads a file from GCS, a Local MCP server allows you to define those unique capabilities

The GCS Local MCP server is an open-source GitHub repository of Google-maintained tools that provides you with a reliable bridge to your data. Here are a few tips to keep in mind while designing custom tools:

Provide precise, clear descriptions to minimize incorrect invocations by the models
Implement model-friendly error handling for models to understand their mistakes and self-correct

The GCS Local MCP is now also a part of the MCP Toolbox for Databases, a single open-source repository containing connectors for major data services such as GCS, BigQuery, AlloyDB, Spanner, and Cloud SQL, making it easier to monitor and manage your data ecosystem. The Toolbox offers simplified development with reduced boilerplate code, enhanced security through OAuth2 and OIDC, and end-to-end observability with OpenTelemetry integration.

Get started

Whether you are optimizing an existing process like Snap or automating workflow creations like Airwallex, your unstructured data is one of your agent's greatest assets.

Explore the generally available GCS Remote MCP Server.
Check out our GCS Local MCP GitHub repository to start building custom tools today, or use it as part of MCP Toolbox for Databases.
Reach out to us to discuss your Agent use case with GCS data.