<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Developers &amp; Practitioners</title><link>https://cloud.google.com/blog/topics/developers-practitioners/</link><description>Developers &amp; Practitioners</description><atom:link href="https://cloudblog.withgoogle.com/blog/topics/developers-practitioners/rss/" rel="self"></atom:link><language>en</language><lastBuildDate>Fri, 10 Apr 2026 16:00:09 +0000</lastBuildDate><image><url>https://cloud.google.com/blog/topics/developers-practitioners/static/blog/images/google.a51985becaa6.png</url><title>Developers &amp; Practitioners</title><link>https://cloud.google.com/blog/topics/developers-practitioners/</link></image><item><title>Migrating to Google Cloud’s Application Load Balancer: A practical guide</title><link>https://cloud.google.com/blog/products/networking/migrate-on-prem-application-load-balancing-to-google-cloud/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Migrating your existing application load balancer infrastructure from an on-premises hardware solution to Cloud Load Balancing offers substantial advantages in scalability, cost-efficiency, and tight integration within the Google Cloud ecosystem. Yet, a fundamental question often arises: "What about our current load balancer configurations?"&lt;/span&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Existing on-premises load balancer configurations often contain years of business-critical logic for traffic manipulation. The good news is that not only can you fully migrate existing functionalities, but this migration also presents a significant opportunity to modernize and simplify your traffic management.&lt;/span&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;This guide outlines a practical approach for migrating your existing load balancer to Google Cloud’s Application Load Balancer. It addresses common functionalities, leveraging both its declarative configurations and the innovative, event-driven Service Extensions edge compute capability.&lt;/span&gt;&lt;/p&gt;
&lt;h3 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;A simple, phased approach to migration&lt;/span&gt;&lt;/h3&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Transitioning from an imperative, script-based system to a cloud-native, declarative-first model requires a structured plan. We recommend a straightforward, four-phase approach.&lt;/span&gt;&lt;/p&gt;
&lt;h4 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Phase 1: Discovery and mapping&lt;/span&gt;&lt;/h4&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Before commencing any migration, you must understand what you have. Analyze and categorize your current load balancer configurations. What is each rule's intent? Is it performing a simple HTTP-to-HTTPS redirect? Is it engaged in HTTP header manipulation (addition or removal)? Or is it handling complex, custom authentication logic? &lt;/span&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Most configurations typically fall into two primary categories:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Common patterns:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Logic that is common to most web applications, such as redirects, URL rewrites, basic header manipulation, and IP-based access control lists (ACLs).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Bespoke business logic:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Complex logic unique to your application, like custom proprietary token authentication, advanced header extraction / replacement, dynamic backend selection based on HTTP attributes, or HTTP response body manipulation. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Phase 2: Choose your Google Cloud equivalent&lt;/span&gt;&lt;/h4&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Once your rules are categorized, the next step involves mapping them to the appropriate Google Cloud feature. This is not a one-to-one replacement; it's a strategic choice.&lt;/span&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Option 1: the declarative path (for ~80% of rules)&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;For the majority of common patterns, leveraging the Application Load Balancer's built-in declarative features is usually the best approach. Instead of a script, you define the desired state in a configuration file. This is simpler to manage, version-control, and scale.&lt;/span&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Common patterns to declarative feature mapping:  &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="3" style="list-style-type: square; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Redirects/rewrites&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; -&amp;gt; &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Application Load Balancer URL maps&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="3" style="list-style-type: square; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;ACLs/throttling&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; -&amp;gt; &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Cloud Armor security policies&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="3" style="list-style-type: square; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Session persistence&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; -&amp;gt; &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;backend service configuration&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Option 2: The programmatic path (for complex, bespoke rules)&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;When dealing with complex, bespoke business logic, you have a programmatic equivalent: &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/service-extensions/docs/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Service Extensions&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, a powerful edge compute capability that allows you to inject custom code (written in Rust, C++ or Go) directly into the load balancer's data path. This approach gives you flexibility in a modern, managed, and high-performance framework.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_bkebSe1.max-1000x1000.jpg"
        
          alt="image1"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="s1mli"&gt;This flowchart helps you decide the appropriate Google Cloud feature for each configuration&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Phase 3: Test and validate&lt;/span&gt;&lt;/h4&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Once you’ve chosen the appropriate path for your configurations, you are ready to &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;deploy your new Application Load Balancer configuration in a staging environment that mirrors your production setup. Thoroughly test all application functionality, paying close attention to the migrated logic. Use a combination of automated testing and manual QA to validate the redirects, security policies, and that the custom Service Extensions logic are behaving as expected.&lt;/span&gt;&lt;/p&gt;
&lt;h4 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Phase 4: Phased cutover (canary deployment)&lt;/span&gt;&lt;/h4&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Don't flip a single switch for all your traffic; instead, implement a phased migration strategy. Start the transitioning process by routing a small percentage of production traffic (e.g., 5-10%) to your new Google Cloud load balancer. During this initial period, be sure to monitor key metrics like latency, error rates, and application performance. As you gain confidence, you can progressively increase the percentage of traffic routed to the Application Load Balancer. Always have a clear rollback plan to revert back to the legacy infrastructure in the event you encounter critical issues.&lt;/span&gt;&lt;/p&gt;
&lt;h3 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Best practices for a smooth migration&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Drawing from our practical experience, we have compiled the following recommendations to assist you in planning your load balancer migrations. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Analyze first, migrate second:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; A thorough analysis of your existing configurations is the most critical step. Don't "lift and shift" logic that is no longer needed.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Prefer declarative:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Always default to Google Cloud's managed, declarative features (URL Maps, Cloud Armor) first. They are simpler, more scalable, and require less maintenance.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Use Service Extensions strategically:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Reserve Service Extensions for the complex, bespoke business logic that declarative features cannot handle.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Monitor everything:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Continuously monitor both your existing load balancers and Google Cloud load balancers during the migration. Watch key metrics like traffic volume, latency, and error rates to detect and address issues instantly.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Train your team:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Ensure your team is trained on Cloud Load Balancing concepts. This will empower them to effectively operate and maintain the new infrastructure.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Migrating from the existing on-premises load balancer infrastructure is more than just a technical task, it's an opportunity to modernize your application delivery. By thoughtfully mapping your current load balancing configurations and capabilities to either declarative Application Load Balancer features or programmatic Service Extensions, you can build a more scalable, resilient, and cost-effective infrastructure destined for future demands.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To get started, review the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/load-balancing/docs/application-load-balancer"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Application Load Balancer&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/service-extensions/docs/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Service Extensions&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; features and advanced capabilities to come up with the right design for your application. For more guidance and complex use cases, contact your &lt;/span&gt;&lt;a href="https://cloud.google.com/contact"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud team&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Fri, 10 Apr 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/networking/migrate-on-prem-application-load-balancing-to-google-cloud/</guid><category>Cloud Migration</category><category>Developers &amp; Practitioners</category><category>Networking</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Migrating to Google Cloud’s Application Load Balancer: A practical guide</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/networking/migrate-on-prem-application-load-balancing-to-google-cloud/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Gopinath Balakrishnan</name><title>Customer Engineer, Google Cloud</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Xiaozang Li</name><title>Customer Engineer, Google Cloud</title><department></department><company></company></author></item><item><title>Create Expert Content: Local Testing of a Multi-Agent System with Memory</title><link>https://cloud.google.com/blog/topics/developers-practitioners/create-expert-content-local-testing-of-a-multi-agent-system-with-memory/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In support of our mission to accelerate the developer journey on Google Cloud, we built Dev Signal: a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/build-a-multi-agent-system-for-expert-content-with-google-adk-mcp-and-cloud-run-part-1"&gt;part 1&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/multi-agent-architecture-and-long-term-memory-with-adk-mcp-and-cloud-run?utm_campaign=CDR_0x91b1edb5_default_b8022895&amp;amp;utm_medium=external&amp;amp;utm_source=social"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;part 2&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; of this series, we established the essential groundwork by standardizing the core capabilities through the Model Context Protocol (MCP) and constructing a multi-agent architecture integrated with the Vertex AI memory bank to provide long-term intelligence and persistence. Now, we'll explore how to test your multi-agent system locally!&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If you’d like to dive straight into the code and explore it at your own pace, you can clone the repository &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Testing the agent Locally&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Before transitioning your agentic system to Google Cloud Run, it is essential to ensure that its specialized components work seamlessly together on your workstation. This testing phase allows you to validate trend discovery, technical grounding, and creative drafting within a local feedback loop, saving time and resources during the development process.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In this section, you will configure your local secrets, implement environment-aware utilities, and use a dedicated test runner to verify that Dev Signal can correctly retrieve user preferences from the Vertex AI memory bank on the cloud. This local verification ensures that your agent's "brain" and "hands" are properly synchronized before moving to deployment.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Environment Setup&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Create a &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;.env&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; file in your project root. These variables are used for local development and will be replaced by Terraform/Secret Manager in production.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev-signal&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/.env &lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;and update with your own details.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Note&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;GOOGLE_CLOUD_LOCATION &lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;is set as global because that is where Gemini-3-flash-preview is supported. We will use &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;GOOGLE_CLOUD_LOCATION &lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;for the model location.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# Google Cloud Configuration\r\nGOOGLE_CLOUD_PROJECT=your-project-id\r\nGOOGLE_CLOUD_LOCATION=global\r\nGOOGLE_CLOUD_REGION=us-central1\r\nGOOGLE_GENAI_USE_VERTEXAI=True\r\nAI_ASSETS_BUCKET=your_bucket_name\r\n\r\n# Reddit API Credentials\r\nREDDIT_CLIENT_ID=your_client_id\r\nREDDIT_CLIENT_SECRET=your_client_secret\r\nREDDIT_USER_AGENT=my-agent/0.1\r\n\r\n# Developer Knowledge API Key\r\nDK_API_KEY=your_api_key&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3602243e80&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Helper Utilities&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Create a new directory for your application utils.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;cd dev_signal_agent\r\nmkdir app_utils\r\ncd app_utils&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3602243f10&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;Environment configuration &lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This module standardizes how the agent discovers the active Google Cloud Project and Region, ensuring a seamless transition between development environments. Using &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;load_dotenv()&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, the script first checks for local configurations before falling back to &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;google.auth.default()&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; or environment variables to retrieve the Project ID. This automated approach ensures your agent is properly authenticated and grounded in the correct cloud context without requiring manual configuration changes.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Beyond basic project discovery, the script provides a robust &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Secret Management&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; layer. It attempts to resolve sensitive credentials, such as Reddit API keys, first from the local environment (for rapid development) and then dynamically from the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/secret-manager/docs/reference/rest?rep_location=me-central2&amp;amp;utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud Secret Manager API&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for production security. By returning these as a dictionary rather than injecting them into environment variables, the module maintains a clean security posture.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The script further calibrates the environment by distinguishing between global and regional requirements for different AI services. It specifically assigns the "global" location for models to access cutting-edge preview features while designating a regional location, such as &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;us-central1&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, for infrastructure like the Vertex AI Agent Engine. By finalizing this setup with a global SDK initialization, the module integrates these settings into the session, allowing the rest of your application to interact with models and memory banks without having to repeatedly pass project or location parameters.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/app_utils/env.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;import os\r\nimport google.auth\r\nimport vertexai\r\nfrom google.cloud import secretmanager\r\nfrom dotenv import load_dotenv\r\n\r\ndef _fetch_secrets(project_id: str):\r\n    &amp;quot;&amp;quot;&amp;quot;Fetch secrets from Secret Manager and return them as a dictionary.&amp;quot;&amp;quot;&amp;quot;\r\n    secrets_to_fetch = [&amp;quot;REDDIT_CLIENT_ID&amp;quot;, &amp;quot;REDDIT_CLIENT_SECRET&amp;quot;, &amp;quot;REDDIT_USER_AGENT&amp;quot;, &amp;quot;DK_API_KEY&amp;quot;]\r\n    fetched_secrets = {}\r\n\r\n    # First, check local environment (for local development via .env)\r\n    for s in secrets_to_fetch:\r\n        val = os.getenv(s)\r\n        if val:\r\n            fetched_secrets[s] = val\r\n\r\n    # If keys are missing (common in production), fetch from Secret Manager API\r\n    if len(fetched_secrets) &amp;lt; len(secrets_to_fetch):\r\n        client = secretmanager.SecretManagerServiceClient()\r\n        for secret_id in secrets_to_fetch:\r\n            if secret_id not in fetched_secrets:\r\n                name = f&amp;quot;projects/{project_id}/secrets/{secret_id}/versions/latest&amp;quot;\r\n                try:\r\n                    response = client.access_secret_version(request={&amp;quot;name&amp;quot;: name})\r\n                    # DO NOT set os.environ[secret_id] here. \r\n                    # Keep it in this dictionary only.\r\n                    fetched_secrets[secret_id] = response.payload.data.decode(&amp;quot;UTF-8&amp;quot;)\r\n                except Exception as e:\r\n                    print(f&amp;quot;Warning: Could not fetch {secret_id} from Secret Manager: {e}&amp;quot;)\r\n\r\n    return fetched_secrets\r\n\r\ndef init_environment():\r\n    &amp;quot;&amp;quot;&amp;quot;Consolidated environment discovery.&amp;quot;&amp;quot;&amp;quot;\r\n    load_dotenv()\r\n    try:\r\n        _, project_id = google.auth.default()\r\n    except Exception:\r\n        project_id = os.getenv(&amp;quot;GOOGLE_CLOUD_PROJECT&amp;quot;)\r\n    \r\n    model_location = os.getenv(&amp;quot;GOOGLE_CLOUD_LOCATION&amp;quot;, &amp;quot;global&amp;quot;)\r\n    service_location = os.getenv(&amp;quot;GOOGLE_CLOUD_REGION&amp;quot;, &amp;quot;us-central1&amp;quot;)\r\n    \r\n    secrets = {}\r\n    if project_id:\r\n        vertexai.init(project=project_id, location=service_location)\r\n        # Fetch secrets into a local variable\r\n        secrets = _fetch_secrets(project_id)\r\n        \r\n    return project_id, model_location, service_location, secrets&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffceeb0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Local testing script&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The Google ADK comes with a built-in Web UI, This UI is excellent for visualizing agent logic and tool composition.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can launch it by running in the project root:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;uv run adk web&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ec3398e0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;However, the default Web UI will not test the long-term memory integration described in this tutorial because it is not pre-connected to a Vertex AI memory session. By default, the generic UI often relies on in-memory services that do not persist data across sessions. Therefore, we use the dedicated &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;test_local.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; script to explicitly initialize the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;VertexAiMemoryBankService&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. This ensures that even in a local environment, your agent is communicating with the real cloud-based memory bank to validate preference persistence.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;test_local.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; script:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Connects to the real &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/agent-builder/agent-engine/overview?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI Agent Engine&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in the cloud for memory storage.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Uses an in-memory session service for local chat history (so you can wipe it easily).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Run a chat loop where you can talk to your agent.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Go back to the root folder  &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev-signal&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;:&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;cd ../..&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ec339640&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev-signal&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/test_local.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;import asyncio\r\nimport os\r\nimport google.auth\r\nimport vertexai\r\nimport uuid\r\nfrom dotenv import load_dotenv\r\nfrom google.adk.runners import Runner\r\nfrom google.adk.memory.vertex_ai_memory_bank_service import VertexAiMemoryBankService\r\nfrom google.adk.sessions import InMemorySessionService\r\nfrom vertexai import agent_engines\r\nfrom google.genai import types\r\nfrom dev_signal_agent.agent import root_agent\r\n\r\n# Load environment variables\r\nload_dotenv()\r\n\r\nasync def main():\r\n    # 1. Setup Configuration\r\n    project_id = os.getenv(&amp;quot;GOOGLE_CLOUD_PROJECT&amp;quot;)\r\n    # Agent Engine (Memory) MUST use a regional endpoint\r\n    resource_location = &amp;quot;us-central1&amp;quot;\r\n    agent_name = &amp;quot;dev-signal&amp;quot;\r\n    \r\n    print(f&amp;quot;--- Initializing Vertex AI in {resource_location} ---&amp;quot;)\r\n    vertexai.init(project=project_id, location=resource_location)\r\n\r\n    # 2. Find the Agent Engine Resource for Memory\r\n    existing_agents = list(agent_engines.list(filter=f&amp;quot;display_name={agent_name}&amp;quot;))\r\n    if existing_agents:\r\n        agent_engine = existing_agents[0]\r\n        agent_engine_id = agent_engine.resource_name.split(&amp;quot;/&amp;quot;)[-1]\r\n        print(f&amp;quot;✅ Using persistent Memory Bank from Agent: {agent_engine_id}&amp;quot;)\r\n    else:\r\n        print(f&amp;quot;❌ Error: Agent Engine \&amp;#x27;{agent_name}\&amp;#x27; not found. Please deploy with Terraform first.&amp;quot;)\r\n        return\r\n\r\n    # 3. Initialize Services\r\n    # We use InMemorySessionService for easier local testing (IDs are flexible)\r\n    # BUT we use VertexAiMemoryBankService for REAL cloud persistence\r\n    session_service = InMemorySessionService()\r\n    \r\n    memory_service = VertexAiMemoryBankService(\r\n        project=project_id,\r\n        location=resource_location,\r\n        agent_engine_id=agent_engine_id\r\n    )\r\n\r\n    # 4. Create a Runner\r\n    runner = Runner(\r\n        agent=root_agent,\r\n        app_name=&amp;quot;dev-signal&amp;quot;,\r\n        session_service=session_service,\r\n        memory_service=memory_service \r\n    )\r\n\r\n    # 5. Run a Test Loop\r\n    user_id = &amp;quot;local-tester&amp;quot;\r\n    \r\n    print(&amp;quot;\\n--- TEST SCENARIO ---&amp;quot;)\r\n    print(&amp;quot;1. Start a session, tell the agent your preference (e.g., \&amp;#x27;write in rhymes\&amp;#x27;).&amp;quot;)\r\n    print(&amp;quot;2. Type \&amp;#x27;new\&amp;#x27; to start a FRESH session (local state wiped).&amp;quot;)\r\n    print(&amp;quot;3. Ask for a blog post. The agent should retrieve your preference from the CLOUD memory.&amp;quot;)\r\n    \r\n    current_session_id = f&amp;quot;session-{str(uuid.uuid4())[:8]}&amp;quot;\r\n    await session_service.create_session(\r\n        app_name=&amp;quot;dev-signal&amp;quot;,\r\n        user_id=user_id,\r\n        session_id=current_session_id\r\n    )\r\n    print(f&amp;quot;\\n--- Chat Session (ID: {current_session_id}) ---&amp;quot;)\r\n\r\n    while True:\r\n        user_input = input(&amp;quot;\\nYou: &amp;quot;)\r\n        \r\n        if user_input.lower() in [&amp;quot;exit&amp;quot;, &amp;quot;quit&amp;quot;]:\r\n            break\r\n            \r\n        if user_input.lower() == &amp;quot;new&amp;quot;:\r\n            # Simulate starting a completely fresh session\r\n            current_session_id = f&amp;quot;session-{str(uuid.uuid4())[:8]}&amp;quot;\r\n            await session_service.create_session(\r\n                app_name=&amp;quot;dev-signal&amp;quot;,\r\n                user_id=user_id,\r\n                session_id=current_session_id\r\n            )\r\n            print(f&amp;quot;\\n--- Fresh Session Started (ID: {current_session_id}) ---&amp;quot;)\r\n            print(&amp;quot;(Local history is empty, retrieval must come from Memory Bank)&amp;quot;)\r\n            continue\r\n\r\n        print(&amp;quot;Agent is thinking...&amp;quot;)\r\n        async for event in runner.run_async(\r\n            user_id=user_id,\r\n            session_id=current_session_id,\r\n            new_message=types.Content(parts=[types.Part(text=user_input)])\r\n        ):\r\n            if event.content and event.content.parts:\r\n                for part in event.content.parts:\r\n                    if part.text:\r\n                        print(f&amp;quot;Agent: {part.text}&amp;quot;)\r\n            \r\n            if event.get_function_calls():\r\n                for fc in event.get_function_calls():\r\n                    print(f&amp;quot;?️  Tool Call: {fc.name}&amp;quot;)\r\n\r\nif __name__ == &amp;quot;__main__&amp;quot;:\r\n    asyncio.run(main())&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ec339820&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;Running the Test&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;First, ensure you have your Application Default Credentials set up:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;gcloud auth application-default login&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ec339760&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Then run the script:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;uv run test_local.py&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ec339790&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Test Scenario&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This scenario validates the full end-to-end lifecycle of the agent: from discovery and research to multimodal content creation and long-term memory retrieval.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Phase &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;1: Teaching &amp;amp; Multimodal Creation (Session 1)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Goal: Establish technical context and set a specific stylistic preference.&lt;/span&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h4 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Discovery&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/h4&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Ask the agent to find trending Cloud Run topics.&lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;Input&lt;/span&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;"Find high-engagement questions about AI agents on Cloud Run from the last 21 days."&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/test1.max-1000x1000.png"
        
          alt="test1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/test2.max-1000x1000.png"
        
          alt="test2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Research&lt;/span&gt;&lt;/h4&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Instruct the agent to perform a deep dive on a specific result.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;Input&lt;/span&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;"Use the GCP Expert to research topic #1."&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/test3.max-1000x1000.png"
        
          alt="test3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Personalization&lt;/span&gt;&lt;/h4&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Request a blog post and explicitly set your style preference.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;Input&lt;/span&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;"Draft a blog post based on this research. From now on, I want all my technical blogs written in the style of a 90s Rap Song."&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/test4.max-1000x1000.png"
        
          alt="test4"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Image generation&lt;/span&gt;&lt;/h4&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Ask the agent to generate an image that demonstrates the main ideas in the blog using the Nano Banana Pro tool. The image would be saved to your bucket in Google Cloud and you should get the path to see it which will look like this: &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;https://storage.mtls.cloud.google.com/...&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/tokenoptimization.max-1000x1000.png"
        
          alt="tokenoptimization"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Phase &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;2: Long-Term Memory Recall (Session 2)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Goal: Verify the agent recalls preferences across a completely fresh session.&lt;/span&gt;&lt;/em&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Type &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;new&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; in the console to wipe local session history and start a fresh state.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Retrieval) Inquire about your stored preferences to test the Vertex AI memory bank.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ol&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Input&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;"What are my current topics of interest and what is my preferred blogging style?"&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Verification: Confirm the agent successfully retrieves your "AI Agents on Cloud Run" interest and "Rap" style from the cloud.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/test5.max-1000x1000.png"
        
          alt="test5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt;Final Test&lt;/strong&gt;: Ask for a new blog on a different topic (e.g., "GKE Autopilot") and ensure it is automatically written as a rap song without being prompted.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Summary&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In this part of our series we focused on verifying the agent's functionality in a local environment before proceeding to cloud deployment. By configuring local secrets and utilizing environment-aware utilities, we used a dedicated test runner to confirm that the core reasoning and tool logic are properly integrated. We successfully validated the full lifecycle: from Reddit discovery to expert content creation, confirming that the agent correctly retrieves preferences from the cloud-based Vertex AI memory bank even in completely fresh sessions.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to run the test scenario yourself? Clone the &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;repository&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and try the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;test_local.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; script to see 'Dev Signal' retrieve your preferences from the Vertex AI memory bank in real-time. For a deeper dive into the underlying mechanics of memory orchestration, check out this &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/agent-builder/agent-engine/memory-bank/quickstart-adk?content_ref=manage%20long%20term%20memories%20for%20you%20this%20tutorial%20demonstrates%20how%20you%20can%20use%20memory%20bank%20with%20the%20adk%20to%20manage%20long%20term%20memories%20create%20your%20local%20adk%20agent%20and%20runner&amp;amp;utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;quickstart guide&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the final part of this series, we will transition our prototype into production service on Google Cloud Run using Terraform for secure infrastructure and explore the roadmap to production excellence through continuous evaluation and security&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Special thanks to &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/remigiusz-samborski/" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;Remigiusz Samborski&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt; for the helpful review and feedback on this article.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;For more content like this, Follow me on &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/shirmeirlador/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Linkedin&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://x.com/shirmeir86?lang=en" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;X&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Fri, 10 Apr 2026 08:11:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/create-expert-content-local-testing-of-a-multi-agent-system-with-memory/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/devsignalheroimage.max-600x600.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Create Expert Content: Local Testing of a Multi-Agent System with Memory</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/devsignalheroimage.max-600x600.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/create-expert-content-local-testing-of-a-multi-agent-system-with-memory/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Shir Meir Lador</name><title>Head of AI, Product DevRel</title><department></department><company></company></author></item><item><title>Experimenting with GPUs: GKE managed DRANET and Inference Gateway AI Deployment</title><link>https://cloud.google.com/blog/topics/developers-practitioners/experimenting-with-gpus-gke-managed-dranet-and-inference-gateway-ai-deployment/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Building and serving models on infrastructure is a strong use case for businesses. In Google Cloud, you have the ability to design your AI infrastructure to suit your workloads. Recently, I experimented with Google Kubernetes Engine &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/how-to/allocate-network-resources-dra"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;(GKE) managed DRANET&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; while deploying a model for inference with NVIDIA B200 GPUs on GKE. In this blog, we will explore this setup in easy to follow steps.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;What is DRANET &lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Dynamic Resource Allocation (DRA)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is a feature that lets you request and share resources among Pods. DRANET allows you to request and allocate networking resources for your Pods, including network interfaces that support TPUs &amp;amp; Remote Direct Memory Access (RDMA). In my case, the use of high-end GPUs.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;How GPU RDMA VPC works &lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vpc/docs/rdma-network-profiles#overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;RDMA network&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is set up as an isolated VPC, which is regional and assigned a network profile type. In this case, the network profile type is RoCEv2. This VPC is dedicated for GPU-to-GPU communication. The GPU VM families have RDMA capable NICs that connect to the RDMA VPC. The GPUs communicate between multiple nodes via this low latency, high speed rail aligned setup.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Design pattern example&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our aim was to deploy a LLM model (Deepseek) onto a GKE cluster with &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;A4 nodes&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; that support 8 B200 GPUs and serve it via &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GKE Inference gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; privately. To set up an &lt;a href="https://docs.cloud.google.com/ai-hypercomputer/docs/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AI Hypercomputer&lt;/span&gt;&lt;/a&gt; GKE cluster, you can use the Cluster Toolkit, but in my case, I wanted to test the &lt;span style="vertical-align: baseline;"&gt;GKE managed &lt;/span&gt;DRANET dynamic setup of the networking that supports RDMA for the GPU communication.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1-archgpu.max-1000x1000.png"
        
          alt="1-archgpu"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This design utilizes the following services to provide an end-to-end solution:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;VPC:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Total of 3 VPC. One VPC manually created, two created automatically by &lt;span style="vertical-align: baseline;"&gt;GKE managed &lt;/span&gt;DRANET, one standard and one for RDMA.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;GKE:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; To deploy the workload.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;GKE Inference gateway:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; To expose the workload internally using a regional internal Application Load Balancers type gke-l7-rilb.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;A4 VM’s:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; These support RoCEv2 with NVIDIA B200 GPU.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Putting it together &lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To get access to the A4 VM a &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/ai-hypercomputer/docs/consumption-models#comparison"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;future reservation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; was used. This is linked to a specific zone.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Begin:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Set up the environment &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Create a &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vpc/docs/create-modify-vpc-networks#create-custom-network"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;standard VPC&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, with firewall rules and subnet in the same zone as the reservation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Create a &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/load-balancing/docs/proxy-only-subnets#proxy_only_subnet_create"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;proxy-only subnet&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; this will be used with the Internal regional application load balancer attached to the GKE inference gateway&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Next&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Create a standard GKE cluster node and default node pool.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;gcloud container clusters create $CLUSTER_NAME \\\r\n    --location=$ZONE \\\r\n    --num-nodes=1 \\\r\n    --machine-type=e2-standard-16 \\\r\n    --network=${GVNIC_NETWORK_PREFIX}-main \\\r\n    --subnetwork=${GVNIC_NETWORK_PREFIX}-sub \\\r\n    --release-channel rapid \\\r\n    --enable-dataplane-v2 \\\r\n    --enable-ip-alias \\\r\n    --addons=HttpLoadBalancing,RayOperator \\\r\n    --gateway-api=standard \\\r\n    --enable-ray-cluster-logging \\\r\n    --enable-ray-cluster-monitoring \\\r\n    --enable-managed-prometheus \\\r\n    --enable-dataplane-v2-metrics \\\r\n    --monitoring=SYSTEM&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffaaac0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Once that is complete you can connect to your cluster:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffaa670&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Create a &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/how-to/allocate-network-resources-dra#enable-dra-driver-gpu"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GPU node pool&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (this example uses, A4 VM with reservation) and additionals flags: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code style="vertical-align: baseline;"&gt;---accelerator-network-profile=auto&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; (GKE automatically adds the gke.networks.io/accelerator-network-profile: auto label to the nodes) &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code style="vertical-align: baseline;"&gt;--node-labels=cloud.google.com/gke-networking-dra-driver=true &lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;(Enables DRA for high-performance networking)&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;gcloud beta container node-pools create $NODE_POOL_NAME \\\r\n  --cluster $CLUSTER_NAME \\\r\n  --location $ZONE \\\r\n  --node-locations $ZONE \\\r\n  --machine-type a4-highgpu-8g \\\r\n  --accelerator type=nvidia-b200,count=8,gpu-driver-version=latest \\\r\n  --enable-autoscaling --num-nodes=1 --total-min-nodes=1 --total-max-nodes=3 \\\r\n  --reservation-affinity=specific \\\r\n--reservation=projects/$PROJECT/reservations/$RESERVATION_NAME/reservationBlocks/$BLOCK_NAME \\\r\n   --accelerator-network-profile=auto \\\r\n--node-labels=cloud.google.com/gke-networking-dra-driver=true&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffaa8b0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Next:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Create a ResourceClaimTemplate, which will be used to attach the networking resources to your deployments. The &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;deviceClassName: mrdma.google.com &lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;is used for GPU workloads:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;apiVersion: resource.k8s.io/v1\r\nkind: ResourceClaimTemplate\r\nmetadata:\r\n  name: all-mrdma\r\nspec:\r\n  spec:\r\n    devices:\r\n      requests:\r\n      - name: req-mrdma\r\n        exactly:\r\n          deviceClassName: mrdma.google.com\r\n          allocationMode: All&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffaacd0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Deploy model and inference&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;Now that a cluster and node pool is setup,&lt;/span&gt; we can deploy a model and serve it via Inference gateway. In my experiment I used DeepSeek but this could be any model.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Deploy model and services&lt;/span&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt; nodeSelector: gke.networks.io/accelerator-network-profile: auto &lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;is used to assign to the GPU node&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt; resourceClaims: &lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;attaches the resource we defined for networking&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;Create a secret (&lt;/span&gt;&lt;a href="https://huggingface.co/docs/hub/security-tokens#how-to-manage-user-access-tokens" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;I used Hugging Face&lt;/span&gt;&lt;/a&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;strong&gt; token)&lt;/strong&gt;:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;kubectl create secret generic hf-secret \\\r\n  --from-literal=hf_token=${HF_TOKEN}&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffaaa60&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;Deployment&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n  name: deepseek-v3-1-deploy\r\nspec:\r\n  replicas: 1\r\n  selector:\r\n    matchLabels:\r\n      app: deepseek-v3-1\r\n  template:\r\n    metadata:\r\n      labels:\r\n        app: deepseek-v3-1\r\n        ai.gke.io/model: deepseek-v3-1\r\n        ai.gke.io/inference-server: vllm\r\n        examples.ai.gke.io/source: user-guide\r\n    spec:\r\n      containers:\r\n      - name: vllm-inference\r\n        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250819_0916_RC01\r\n        resources:\r\n          requests:\r\n            cpu: &amp;quot;190&amp;quot;\r\n            memory: &amp;quot;1800Gi&amp;quot;\r\n            ephemeral-storage: &amp;quot;1Ti&amp;quot;\r\n            nvidia.com/gpu: &amp;quot;8&amp;quot;\r\n          limits:\r\n            cpu: &amp;quot;190&amp;quot;\r\n            memory: &amp;quot;1800Gi&amp;quot;\r\n            ephemeral-storage: &amp;quot;1Ti&amp;quot;\r\n            nvidia.com/gpu: &amp;quot;8&amp;quot;\r\n          claims:\r\n          - name: rdma-claim\r\n        command: [&amp;quot;python3&amp;quot;, &amp;quot;-m&amp;quot;, &amp;quot;vllm.entrypoints.openai.api_server&amp;quot;]\r\n        args:\r\n        - --model=$(MODEL_ID)\r\n        - --tensor-parallel-size=8\r\n        - --host=0.0.0.0\r\n        - --port=8000\r\n        - --max-model-len=32768\r\n        - --max-num-seqs=32\r\n        - --gpu-memory-utilization=0.90\r\n        - --enable-chunked-prefill\r\n        - --enforce-eager\r\n        - --trust-remote-code\r\n        env:\r\n        - name: MODEL_ID\r\n          value: deepseek-ai/DeepSeek-V3.1\r\n        - name: HUGGING_FACE_HUB_TOKEN\r\n          valueFrom:\r\n            secretKeyRef:\r\n              name: hf-secret\r\n              key: hf_token\r\n        volumeMounts:\r\n        - mountPath: /dev/shm\r\n          name: dshm\r\n        livenessProbe:\r\n          httpGet:\r\n            path: /health\r\n            port: 8000\r\n          initialDelaySeconds: 1800\r\n          periodSeconds: 10\r\n        readinessProbe:\r\n          httpGet:\r\n            path: /health\r\n            port: 8000\r\n          initialDelaySeconds: 1800\r\n          periodSeconds: 5\r\n      volumes:\r\n      - name: dshm\r\n        emptyDir:\r\n            medium: Memory\r\n      nodeSelector:\r\n        gke.networks.io/accelerator-network-profile: auto\r\n      resourceClaims:\r\n      - name: rdma-claim\r\n        resourceClaimTemplateName: all-mrdma\r\n---\r\napiVersion: v1\r\nkind: Service\r\nmetadata:\r\n  name: deepseek-v3-1-service\r\nspec:\r\n  selector:\r\n    app: deepseek-v3-1\r\n  type: ClusterIP\r\n  ports:\r\n    - protocol: TCP\r\n      port: 8000\r\n      targetPort: 8000\r\n---\r\napiVersion: monitoring.googleapis.com/v1\r\nkind: PodMonitoring\r\nmetadata:\r\n  name: deepseek-v3-1-monitoring\r\nspec:\r\n  selector:\r\n    matchLabels:\r\n      app: deepseek-v3-1\r\n  endpoints:\r\n  - port: 8000\r\n    path: /metrics\r\n    interval: 30s&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ecb46b80&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Deploy GKE Inference Gateway&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/how-to/deploy-gke-inference-gateway#prepare-environment"&gt;install needed Custom Resource Definitions (CRDs) in your GKE cluster:&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For GKE versions &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;1.34.0-gke.1626000&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; or later, install only the alpha &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;InferenceObjective&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; CRD:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ecb466a0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Create Inference pool  &lt;/span&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;helm install deepseek-v3-pool \\\r\n  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \\\r\n  --version v1.0.1 \\\r\n  --set inferencePool.modelServers.matchLabels.app=deepseek-v3-1 \\\r\n  --set provider.name=gke \\\r\n  --set inferenceExtension.monitoring.gke.enabled=true&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ecb46970&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Create the Gateway, HTTPRoute and InferenceObjective&lt;/span&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# 1. The Regional Internal Gateway (ILB)\r\napiVersion: gateway.networking.k8s.io/v1\r\nkind: Gateway\r\nmetadata:\r\n  name: deepseek-v3-gateway\r\n  namespace: default\r\nspec:\r\n  gatewayClassName: gke-l7-rilb\r\n  listeners:\r\n  - name: http\r\n    protocol: HTTP\r\n    port: 80\r\n    allowedRoutes:\r\n      namespaces:\r\n        from: Same\r\n---\r\n# 2. The HTTPRoute (Routing to the Pool)\r\napiVersion: gateway.networking.k8s.io/v1\r\nkind: HTTPRoute\r\nmetadata:\r\n  name: deepseek-v3-route\r\n  namespace: default\r\nspec:\r\n  parentRefs:\r\n  - name: deepseek-v3-gateway\r\n  rules:\r\n  - matches:\r\n    - path:\r\n        type: PathPrefix\r\n        value: /\r\n    backendRefs:\r\n    - group: inference.networking.k8s.io\r\n      kind: InferencePool\r\n      name: deepseek-v3-pool\r\n---\r\n# 3. The Inference Objective (Performance Logic)\r\napiVersion: inference.networking.x-k8s.io/v1alpha2\r\nkind: InferenceObjective\r\nmetadata:\r\n  name: deepseek-v3-objective\r\n  namespace: default\r\nspec:\r\n  poolRef:\r\n    name: deepseek-v3-pool&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ecb46490&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Once complete, you can create a test VM in your main VPC and make a call to the IP address of the GKE Inference Gateway:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;curl -N -s -X POST &amp;quot;http://$GATEWAY_IP/v1/chat/completions&amp;quot; \\\r\n  -H &amp;quot;Content-Type: application/json&amp;quot; \\\r\n  -d \&amp;#x27;{\r\n    &amp;quot;model&amp;quot;: &amp;quot;deepseek-ai/DeepSeek-V3.1&amp;quot;,\r\n    &amp;quot;messages&amp;quot;: [{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Box A: red. Box B: blue. Box C: empty. Move A to C, Move B to A, Swap B and C. Where is red?&amp;quot;}],\r\n    &amp;quot;stream&amp;quot;: true\r\n  }\&amp;#x27; | stdbuf -oL grep &amp;quot;data: &amp;quot; | sed -u \&amp;#x27;s/^data: //\&amp;#x27; | grep -v &amp;quot;\\[DONE\\]&amp;quot; | \\\r\n  jq --unbuffered -rj \&amp;#x27;.choices[0].delta | (.reasoning_content // .reasoning // .content // empty)\&amp;#x27;&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ecb464c0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Next Steps&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Take a deeper dive into GKE managed DRANET and GKE Inference Gateway, review the following.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Blog: &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-device-management-with-dra-dynamic-resource-allocation?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;DRA: A new era of Kubernetes device management with Dynamic Resource Allocation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Document set: &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/how-to/config-auto-net-for-accelerators"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;DRANET&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Documentation: &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/ai-hypercomputer/docs/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AI Hypercomputer&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Want to ask a question, find out more or share a thought? Please connect with me on &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/ammett/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Linkedin&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 08 Apr 2026 10:05:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/experimenting-with-gpus-gke-managed-dranet-and-inference-gateway-ai-deployment/</guid><category>Networking</category><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/0-hero-dranet.max-600x600.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Experimenting with GPUs: GKE managed DRANET and Inference Gateway AI Deployment</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/0-hero-dranet.max-600x600.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/experimenting-with-gpus-gke-managed-dranet-and-inference-gateway-ai-deployment/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Ammett Williams</name><title>Developer Relations Engineer</title><department></department><company></company></author></item><item><title>See beyond the IP and secure URLs with Google Cloud NGFW</title><link>https://cloud.google.com/blog/products/identity-security/see-beyond-the-ip-and-secure-urls-with-google-cloud-ngfw/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In a cloud-first world, traditional IP-based defenses are no longer enough to protect your perimeter. As services migrate to shared infrastructure and content delivery networks, relying on static IP addresses and FQDNs can create security gaps.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Because single IP addresses can host multiple services, and IPs addresses can change frequently, we are introducing domain filtering with a wildcard capability in Cloud Next Generation Firewall (NGFW) Enterprise. This new capability provides increased security and granular policy controls.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Why domain and SNI filtering matters&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The Cloud NGFW URL filtering service performs deep inspections of HTTP payloads to secure workloads against threats from both public and internal networks. This service elevates security controls to the application layer and helps restrict access to malicious domains. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Key use cases include: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Granular egress control&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This capability enables the precise allowing and blocking of connections based on domain names and SNI information found in egress HTTP(S) messages. By inspecting Layer 7 (L7) headers, it offers significantly finer control than traditional filtering based solely on IP addresses and FQDNs, which can be inefficient when a single IP hosts multiple services.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Control access without decrypting&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: For organizations that prefer not to perform full TLS decryption on their traffic, Cloud NGFW can still enforce security policies by controlling traffic based on SNI headers provided during the TLS handshake. This allows for effective domain-level filtering while maintaining end-to-end encryption for privacy or compliance reasons.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Reduced operational overhead&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Implementing domain-based filtering helps reduce the constant maintenance typically required to track frequently changing IP addresses and DNS records. By focusing on stable domain identities rather than dynamic network attributes, security teams can minimize the manual effort involved in updating firewall rulebases.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Flexible matching&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The service utilizes matcher strings within URL lists, supporting limited wildcard domains to define criteria for both domains and subdomains. For example, using a wildcard like &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;*.example.com&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; allows a single filter to cover all associated subdomains, providing a more scalable solution than defining thousands of individual FQDN entries.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Improved security: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;URL filtering significantly enhances the security posture by protecting against sophisticated flaws like SNI header spoofing. By evaluating L7 headers before allowing access to an application, Cloud NGFW ensures that attackers cannot bypass security controls by simply spoofing lower-layer identifiers. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;How Cloud NGFW URL filtering works&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The URL filtering service functions by inspecting traffic at L7 using a distributed architecture. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_zzP0Xt6.max-1000x1000.png"
        
          alt="image1"&gt;
        
        &lt;/a&gt;
      
        &lt;figcaption class="article-image__caption "&gt;&lt;p data-block-key="6nmqq"&gt;Cloud NGFW URL filtering service&lt;/p&gt;&lt;/figcaption&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You can get started with URL filtering in three simple steps.&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Deploy Cloud NGFW endpoints&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ol&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The first step is to &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/configure-firewall-endpoints#create-firewall-endpoint"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;create and deploy a Cloud NGFW endpoint&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in a zone. The &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/about-firewall-endpoints"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;NGFW endpoint&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is an organization level resource. Please ensure you have the right permission before deploying the endpoint.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Once the endpoint is deployed you can &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/configure-firewall-endpoint-associations#create-end-assoc-network"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;associate it to one or more VPCs&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; of your choice.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Create security profiles and security profile groups:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ol&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/about-security-profiles#url-filtering-profile"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;URL filtering security profile&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; holds the URL filters with matcher strings and an action (allow or deny).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/about-security-profile-groups"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;security profile group&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; acts as a container for these security profiles, which is then referenced by a firewall policy rule. &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/configure-urlf-security-profiles#create-urlf-security-profile"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Create URL filtering security profiles&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; with desired URLs, wildcard FQDNs and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/configure-security-profile-groups#create-security-profile-group"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;add them to a security profile group&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Once the security profile group is created, you will need to reference the security profile group in firewall policies.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Policy enforcement:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ol&gt;
&lt;li aria-level="2" style="list-style-type: lower-alpha; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;You enable the service by configuring a hierarchical or global network firewall policy rule using the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;apply_security_profile_group&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; action, specifying the name of your security profile group. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For more information about configuring a firewall policy rule, see the following:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/using-firewall-policies#create-ingress-rule-target-vm"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Create an ingress hierarchical firewall policy rule&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/using-firewall-policies#create-egress-rule-target-vm"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Create an egress hierarchical firewall policy rule&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/use-network-firewall-policies#create-ingress-rule-target-vm"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Create an ingress global network firewall policy rule&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/use-network-firewall-policies#create-egress-rule-target-vm"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Create an egress global network firewall policy rule&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Getting started&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Get started with Cloud NGFW URL filtering by visiting our &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/firewall/docs/about-url-filtering"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://codelabs.developers.google.com/cloud-ngfw-enterprise-urlf" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;codelab&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Tue, 07 Apr 2026 17:30:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/identity-security/see-beyond-the-ip-and-secure-urls-with-google-cloud-ngfw/</guid><category>Networking</category><category>Developers &amp; Practitioners</category><category>Security &amp; Identity</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>See beyond the IP and secure URLs with Google Cloud NGFW</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/identity-security/see-beyond-the-ip-and-secure-urls-with-google-cloud-ngfw/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Uttam Ramesh</name><title>Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Susan Wu</name><title>Outbound Product Manager</title><department></department><company></company></author></item><item><title>Envoy: A future-ready foundation for agentic AI networking</title><link>https://cloud.google.com/blog/products/networking/the-case-for-envoy-networking-in-the-agentic-ai-era/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In today's agentic AI environments, the network has a new set of responsibilities.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In a traditional application stack, the network mainly moves requests between services. But as discussed in a recent white paper,&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;a href="https://services.google.com/fh/files/misc/cloud_infrastructure_in_the_agent_native_era.pdf" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Cloud Infrastructure in the Agent-Native Era&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;,&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; in an agentic system the network sits in the middle of model calls, tool invocations, agent-to-agent interactions, and policy decisions that can shape what an agent is allowed to do. The rapid proliferation of agents, often built on diverse frameworks, necessitates a consistent enforcement of governance and security across all agentic paths at scale. To achieve this, the enforcement layer must shift from the application level to the underlying infrastructure. That means the network can no longer operate as a blind transport layer. It has to understand more, enforce better, and adapt faster. This shift is precisely where Envoy comes in.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As a high-performance distributed proxy and universal data plane, Envoy is built for massive scale. Trusted by demanding enterprise environments, including Google Cloud, it supports everything from single-service deployments to complex service meshes using Ingress, Egress, and Sidecar patterns. Because of its deep extensibility, robust policy integration, and operational maturity, Envoy is uniquely suited for an era where protocols change quickly and the cost of weak control is steep. For teams building agentic AI, Envoy is more than a concept: it's a practical, production-ready foundation.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_xPxMxF4.max-1000x1000.jpg"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Agentic AI changes the networking problem&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Agentic workloads still often use HTTP as a transport, but they break some of the assumptions that traditional HTTP intermediaries rely on. Protocols such as&lt;/span&gt;&lt;a href="https://modelcontextprotocol.io/docs/getting-started/intro" rel="noopener" target="_blank"&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Model Context Protocol&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (MCP) and&lt;/span&gt;&lt;a href="https://github.com/google/A2A" rel="noopener" target="_blank"&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent2agent&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (A2A) use&lt;/span&gt;&lt;a href="https://www.jsonrpc.org/specification" rel="noopener" target="_blank"&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;JSON-RPC&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or&lt;/span&gt;&lt;a href="https://grpc.io" rel="noopener" target="_blank"&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;gRPC&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; over HTTP, adding protocol-level phases such as MCP initialization, where client and server exchange their capabilities, on top of standard HTTP request/response semantics. The key aspects of agentic systems that require intermediaries to adapt include:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Diverse enterprise governance imperatives. &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;The primary challenge is satisfying the wide spectrum of non-negotiable enterprise requirements for safety, security, data privacy, and regulatory compliance. These needs often go beyond standard network policies and require deep integration with internal systems, custom logic, and the ability to rapidly adapt to new organizational rules or external regulations. This demands a highly extensible framework where enterprises can plug in their specific governance models.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Policy attributes live inside message bodies, not headers.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Unlike traditional web traffic where policy inputs like paths and headers are readily accessible, agentic protocols frequently bury critical attributes (e.g., model names, tool calls, resource IDs) deep within JSON-RPC or gRPC payloads. This shift requires intermediaries to possess the ability to parse and understand message contents to apply context-aware policies.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Handling diverse and evolving protocol characteristics. &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Agentic protocols are not uniform. Some, like MCP with Streamable HTTP, can introduce stateful interactions requiring session management across distributed proxies (e.g., using &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;Mcp-Session-Id&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;). The need to support such varied behaviors, along with future protocol innovations, reinforces the necessity of an inherently adaptable and extensible networking foundation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These factors mean enterprises need more than just connectivity. The network must now serve as a central point for enforcing the crucial governance needs mentioned earlier. This includes providing capabilities like centralized security, comprehensive auditability, fine-grained policy enforcement, and dynamic guardrails, all while keeping pace with the rapid evolution of protocols and agent behaviors. Put simply, agentic AI transforms the network from a mere transit path into a critical control point.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Why Envoy fits this shift&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Envoy is a strong fit for agentic AI networking for three reasons. Envoy is:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Battle-tested.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Enterprises already rely on Envoy in high-scale, security-sensitive environments, making it a credible platform to anchor a new generation of traffic management and policy enforcement.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Extensible.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Envoy can be extended through native filters, Rust modules, WebAssembly (Wasm) modules, and &lt;/span&gt;&lt;a href="https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;external processing&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; patterns. That gives platform teams room to adopt new protocols without having to rebuild their networking layer every time the ecosystem changes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Operationally useful today.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Envoy already acts as a gateway, enforcement point, observability layer, and integration surface for control planes. That makes it a practical choice for organizations that need to move now, not after the standards settle.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Building on these core strengths, Envoy has introduced specific architectural advancements to meet the unique demands of agentic networking:&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;1. Envoy understands agent traffic&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The first requirement for agentic networking is simple: The gateway needs to understand what the agent is actually trying to do.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;That’s harder than it sounds. In protocols such as MCP, A2A, and OpenAI-style APIs, important policy signals may live inside the request body. Traditional HTTP proxies are optimized to treat bodies as opaque byte streams. That design is efficient, but it limits what the proxy can enforce. For protocols that use JSON messages, a proxy may need to buffer the entire request body to locate attribute values needed for policy application — especially when those attributes appear at the end of the JSON message. Business logic specific to gen AI protocols, such as rate limiting based on consumed tokens, may also require parsing server responses.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Envoy addresses this by deframing protocol messages carried over HTTP and exposing useful attributes to the rest of the filter chain. The extensibility model for gen AI protocols was guided by two goals:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Easy reuse of existing HTTP extensions that work with gen AI protocols out of the box, such as RBAC or tracers.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Easy access to deframed messages for gen-AI-specific extensions, so that developers can focus on gen AI business logic without needing to deal with HTTP or JSON envelopes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Based on these goals, new extensions for gen AI protocols are still built as HTTP extensions and configured in the HTTP filter chain. This provides flexibility to mix HTTP-native business logic, such as OAuth or mTLS authorization, with gen AI protocol logic in a single chain. A deframing extension parses the protocol messages carried by HTTP and provides an ambient context with extracted attributes, or even the entirety of parsed messages, to downstream extensions via well-known filter state and metadata values.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Instead of forcing every policy component to parse JSON envelopes or protocol-specific message formats on its own, Envoy makes those attributes available as structured metadata. Once the gateway has deframed protocol messages, existing Envoy extensions such as &lt;/span&gt;&lt;a href="https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_authz_filter" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;ext_authz&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or RBAC can read protocol properties to evaluate policies using protocol-specific attributes such as tool names for MCP, message attributes for A2A, or model names for OpenAI.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Access logs can include message attributes for enhanced monitoring and auditing. The protocol attributes are also available to the &lt;/span&gt;&lt;a href="https://cel.dev/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Common Expression Language&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (CEL) runtime, simplifying creation of complex policy expressions in RBAC or composite extensions.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_t4lf1kG.max-1000x1000.png"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Buffering and memory management&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Envoy is designed to use as little memory as possible when proxying HTTP requests. However, parsing agentic protocols may require an arbitrary amount of buffer space, especially when extensions require the entire message to be in memory. The flexibility of allowing extensions to use larger buffers needs to be balanced with adequate protection from memory exhaustion, especially in the presence of untrusted traffic.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To achieve this, Envoy now provides a per-request buffer size limit. Buffers that hold request data are also integrated with the overload manager, enabling a full range of protective actions under memory pressure, such as reducing idle timeouts or resetting requests that consume the most memory for an extended duration. These changes pave the way for Envoy to serve as a gateway and policy-enforcement point for gen AI protocols without compromising its resource efficiency.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;2. Envoy enforces policy on things that matter&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Understanding traffic is only useful if the gateway can act on it.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In agentic systems, policy is not just about which service an agent can reach. It’s about which tools an agent can call, which models it can use, what identity it presents, how much it can consume, and what kinds of outputs require additional controls. Those are higher-value decisions than simple layer-4 or path-based controls, and they are exactly the kinds of controls enterprises care about when agents are allowed to take action on their behalf.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Envoy is well-positioned here because it can combine transport-level security with application-aware policy enforcement. Teams can authenticate workloads with mTLS and SPIFFE identities, then enforce protocol-specific rules with RBAC, external authorization, external processing, access logging, and CEL-based policy expressions.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This capability is crucial because it lets platform teams decouple agent development from enforcement. Developers can focus on building useful agents, while operators enforce a consistent zero-trust posture at the network layer, even as tools, models, and protocols continue to change.&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;A prime example of this zero-trust decoupling is the critical "user-behind-agent" scenario, where an AI agent must execute tasks on a human user's behalf. Traditionally, handing user credentials directly to an application introduces severe security risks — if the agent is compromised or manipulated via prompt injection, an attacker could exfiltrate or misuse those credentials. By offloading identity management to Envoy, the proxy can automatically insert user delegation tokens into outbound requests at the infrastructure layer. Because the agent never directly holds the sensitive credential, the risk of a compromised agent misusing or leaking the token is completely neutralized, ensuring actions remain strictly bound to the user's actual permissions.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Case study: Restricting an agent to specific GitHub MCP tools&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Consider an agent that triages GitHub issues.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The GitHub MCP server may expose dozens of tools, but the agent may only need a small read-only subset, such as &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;list_issues&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;get_issue&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;get_issue_comments&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. In most enterprises, that difference matters. A useful agent should not automatically become an unrestricted one.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With Envoy in front of the MCP server, the gateway can verify the agent identity using SPIFFE during the mTLS handshake, parse the MCP message via &lt;/span&gt;&lt;a href="https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/mcp/v3/mcp.proto#envoy-v3-api-msg-extensions-filters-http-mcp-v3-mcp" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;the deframing filter&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, extract the requested method and tool name, and enforce a policy that allows only the approved tool calls for that specific agent identity. RBAC uses metadata created by the MCP deframing filter to check the method and tool name in the MCP message:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;envoy.filters.http.rbac:\r\n  &amp;quot;@type&amp;quot;: type.googleapis.com/envoy.extensions.filters.http.rbac.v3.RBACPerRoute\r\n  rbac:\r\n    rules:\r\n      policies:\r\n        github-issue-reader-policy:\r\n          permissions:\r\n            - and_rules:\r\n                rules:\r\n                  - sourced_metadata:\r\n                      metadata_matcher:\r\n                        filter: envoy.http.filters.mcp\r\n                        path: [{ key: &amp;quot;method&amp;quot; }]\r\n                        value: { string_match: { exact: &amp;quot;tools/call&amp;quot; } }\r\n                  - sourced_metadata:\r\n                      metadata_matcher:\r\n                        filter: envoy.http.filters.mcp\r\n                        path: [{ key: &amp;quot;params&amp;quot; }, { key: &amp;quot;name&amp;quot; }]\r\n                        value:\r\n                          or_match:\r\n                            value_matchers:\r\n                              - string_match: { exact: &amp;quot;list_issues&amp;quot; }\r\n                              - string_match: { exact: &amp;quot;get_issue&amp;quot; }\r\n                              - string_match: { exact: &amp;quot;get_issue_comments&amp;quot; }\r\n          principals:\r\n            - authenticated:\r\n                principal_name:\r\n                  exact: &amp;quot;spiffe://cluster.local/ns/github-agents/sa/issue-triage-agent&amp;quot;&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3600e814c0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;That’s the real value: Policy is enforced centrally, close to the traffic, and in terms that match the agent's actual behavior.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_jtbLCMn.max-1000x1000.png"
        
          alt="3"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Beyond static rules: External authorization&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;A complex compliance policy that can’t be expressed using RBAC rules can be implemented in an external authorization service using the &lt;/span&gt;&lt;a href="https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_authz_filter" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;ext_authz&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; protocol. Envoy provides MCP message attributes along with HTTP headers in the context of the ext_authz RPC. It can also forward the agent's SPIFFE identity from the peer certificate:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;http_filters:\r\n  - name: envoy.filters.http.ext_authz\r\n    typed_config:\r\n      &amp;quot;@type&amp;quot;: type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz\r\n      grpc_service:\r\n        envoy_grpc:\r\n          cluster_name: auth_service_cluster\r\n      include_peer_certificate: true\r\n      metadata_context_namespaces:\r\n        - envoy.http.filters.mcp&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3600675cd0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This allows external services to make authorization decisions based on the full combination of agent identity, MCP method, tool name, and any other protocol attributes, without the agent or the MCP server needing to be aware of the policy layer.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Protocol-native error responses&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;When Envoy denies a request, the error should be meaningful to the calling agent. For MCP traffic, Envoy can use &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;local_reply_config&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to map HTTP error codes to appropriate JSON-RPC error responses. For example, a 403 Forbidden can be mapped to a JSON-RPC response with &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;isError: true&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and a human-readable message, ensuring the agent receives a protocol-appropriate denial rather than an opaque HTTP status code.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;3. Envoy supports stateful agent interactions at scale&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Not all agent traffic is stateless. Some protocols, including Streamable HTTP for MCP, can rely on session-oriented behavior. That creates a new challenge for intermediaries, especially when traffic flows through multiple gateway instances to achieve scale and resilience. An MCP session effectively binds the agent to the server that established it, and all intermediaries need to know this to direct incoming MCP connections to the correct server.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If a session is established on one backend, later requests in that conversation need to reach the right destination. That sounds straightforward for a single-proxy deployment, but it becomes more complicated in horizontally scaled systems, where multiple Envoy instances may handle different requests from the same agent.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Passthrough gateway&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;In the simpler passthrough mode, Envoy establishes one upstream connection for each downstream connection. Its primary use is enforcing centralized policies, such as client authorization, RBAC, rate limiting, and authentication, for external MCP servers. The session state transferred between intermediaries needs to include only the address of the server that established the session over the initial HTTP connection, so that all session-related requests are directed to that server.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Session state transfer between different Envoy instances is achieved by appending encoded session state to the MCP session ID provided by the MCP server. Envoy removes the session-state suffix from the session ID before forwarding the request to the destination MCP server. This session stickiness is enabled by configuring Envoy's &lt;/span&gt;&lt;a href="https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/http/stateful_session/envelope/v3/envelope.proto" rel="noopener" target="_blank"&gt;&lt;code style="text-decoration: underline; vertical-align: baseline;"&gt;envoy.http.stateful_session.envelope&lt;/code&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; extension.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/4_j0wGyAp.max-1000x1000.png"
        
          alt="4"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Aggregating gateway&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;In aggregating mode, Envoy acts as a single MCP server by aggregating the capabilities, tools, and resources of multiple backend MCP servers. In addition to enforcing policies, this simplifies agent configuration and unifies policy application for multiple MCP servers.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Session management in this mode is more complicated because the session state also needs to include mapping from tools and resources to the server addresses and session IDs that advertised them. The session ID that Envoy provides to the agent is created before tools or resources are known, and the mapping has to be established later, after the MCP initialization phases between Envoy and the backend MCP servers are complete.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;One approach, currently implemented in Envoy, is to combine the name of a tool or resource with the identifier and session ID of its origin server. The exact tool or resource names are typically not meaningful to the agent and can carry this additional provenance information. If unmodified tool or resource names are desirable, another approach is to use an Envoy instance that does not have the mapping, and then recreate it by issuing a &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;tools/list&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; command before calling a specific tool. This trades latency for the complexity of deploying an external global store of MCP sessions, and is currently in planning based on user feedback.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/5_61xwM79.max-1000x1000.png"
        
          alt="5"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This matters because it moves Envoy beyond simple traffic forwarding. It allows Envoy to serve as a reliable intermediary for real agent workflows, including those spanning multiple requests, tools, and backends.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;4. Envoy supports agent discovery&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Envoy is adding support for the A2A protocol and agent discovery via a well-known AgentCard endpoint. AgentCard, a JSON document with agent capabilities, enables discovery and multi-agent coordination by advertising skills, authentication requirements, and service endpoints. The AgentCard can be provisioned statically via direct response configuration or obtained from a centralized agent registry server via xDS or ext_proc APIs. A more detailed description of A2A implementation and agent discovery will be published in a forthcoming blog post.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;5. Envoy is a complete solution for agentic networking challenges&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Building on the same foundation that enabled policy application for MCP protocol in demanding deployments, Envoy is adding support for OpenAI and transcoding of agentic protocols into RESTful HTTP APIs. This transcoding capability simplifies the integration of gen AI agents with existing RESTful applications, with out-of-the-box support for OpenAPI-based applications and custom options via dynamic modules or Wasm extensions. In addition to transcoding, Envoy is being strengthened in critical areas for production readiness, such as advanced policy applications like quota management, comprehensive telemetry adhering to&lt;/span&gt;&lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener" target="_blank"&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;OpenTelemetry semantic conventions for generative AI systems&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, and integrated guardrails for secure agent operation.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Guardrails for safe agents&lt;br/&gt;&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;The next significant area of investment is centralized management and application of guardrails for all agentic traffic. Integrating policy enforcement points with external guardrails presently requires bespoke implementation and this problem area is ripe for standardization.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Control planes make this operational&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The gateway is only part of the story. To achieve this policy management and rollout at scale, a separate control plane is required to dynamically configure the data plane using the xDS protocol, also known as the universal data plane API.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;That is where control planes become important. Cloud Service Mesh, alongside open-source projects such as &lt;/span&gt;&lt;a href="https://aigateway.envoyproxy.io/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Envoy AI Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://github.com/kubernetes-sigs/kube-agentic-networking" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;kube-agentic-networking&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, uses Envoy as the data plane while giving operators higher-level ways to define and manage policy for agentic workloads.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This combination is powerful: Envoy provides the enforcement and extensibility in the traffic path, while control planes provide the operating model teams need to deploy that capability consistently.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Why this matters now&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The shift towards agentic systems and gen AI protocols such as MCP, A2A, and OpenAI necessitates an evolution in network intermediaries. The primary complexities Envoy addresses include:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Deep protocol inspection.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Protocol deframing extensions extract policy-relevant attributes (tool names, model names, resource paths) from the body of HTTP requests, enabling precise policy enforcement where traditional proxies would only see an opaque byte stream.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Fine-grained policy enforcement.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; By exposing these internal attributes, existing Envoy extensions like RBAC and ext_authz can evaluate policies based on protocol-specific criteria. This allows network operators to enforce a unified, zero-trust security posture, ensuring agents comply with access policies for specific tools or resources.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Stateful transport management.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Envoy supports managing session state for the Streamable HTTP transport used by MCP, enabling robust deployments in both passthrough and aggregating gateway modes, even across a fleet of intermediaries.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Agentic AI protocols are still in their early stages, and the protocol landscape will continue to evolve. That’s exactly why the networking layer needs to be adaptable. Enterprises should not have to rebuild their security and traffic infrastructure every time a new agent framework, transport pattern, or tool protocol gains traction. They need a foundation that can absorb change without sacrificing control.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Envoy brings together three qualities that are hard to get in one place: proven production maturity, deep extensibility, and growing protocol awareness for agentic workloads. By leveraging Envoy as an agent gateway, organizations can decouple security and policy enforcement from agent development code.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;That makes Envoy more than just a proxy that happens to handle AI traffic. It makes Envoy a future-ready foundation for agentic AI networking.&lt;/span&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sup&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Special thanks to the additional co-authors of this blog: Boteng Yao, Software Engineer, Google and Tianyu Xia, Software Engineer, Google and Sisira Narayana, Sr Product Manager, Google.&lt;/span&gt;&lt;/sup&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Fri, 03 Apr 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/networking/the-case-for-envoy-networking-in-the-agentic-ai-era/</guid><category>Containers &amp; Kubernetes</category><category>AI &amp; Machine Learning</category><category>GKE</category><category>Developers &amp; Practitioners</category><category>Networking</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Envoy: A future-ready foundation for agentic AI networking</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/networking/the-case-for-envoy-networking-in-the-agentic-ai-era/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Yan Avlasov</name><title>Staff Software Engineer, Google</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Erica Hughberg</name><title>Product and Product Marketing Manager, Tetrate</title><department></department><company></company></author></item><item><title>Activating Your Data Layer for Production-Ready AI</title><link>https://cloud.google.com/blog/topics/developers-practitioners/activating-your-data-layer-for-production-ready-ai/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When discussing applications and systems using generative AI and the new opportunities they present, one component of the ecosystem is irreplaceable - data. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Specifically, the data that companies gather, hold, and use daily. This data serves as the backbone for applications, analytics, knowledge bases, and much more. We use databases to store and work with this data, and most, if not all, AI-driven initiatives and new applications are going to use that data layer.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;But how can we start to use the data in our AI systems? Let me introduce you to some of the labs showing how to prepare and use the data with AI models in Google databases.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Semantic Search: Text Embeddings in Database&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our journey starts by preparing our data for semantic search and running first tests to augment the Gen AI model's response by grounding it with your semantic search results. The grounding data is the basis for RAG (Retrieval Augmented Generation). Then, you can improve the performance of your search by indexing your embeddings using the latest indexing techniques.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;One of the options is the &lt;/span&gt;&lt;a href="https://cloud.google.com/products/alloydb?e=48754805"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Google AlloyDB database&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which has direct integration with AI models and supports the most demanding workloads. The following lab guides us through all the steps, starting from creating an AlloyDB cluster, loading sample data, and generating embeddings, to using those embeddings to generate an augmented response from the Gen AI model.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-aside"&gt;&lt;dl&gt;
    &lt;dt&gt;aside_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;title&amp;#x27;, &amp;#x27;Go to the lab!&amp;#x27;), (&amp;#x27;body&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36023f6550&amp;gt;), (&amp;#x27;btn_text&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;href&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;image&amp;#x27;, None)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;AI integration is not limited to AlloyDB. All Google Cloud databases have AI integration and are capable of generating and using embeddings for semantic search. For example, if you are using &lt;/span&gt;&lt;a href="https://cloud.google.com/sql"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud SQL&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, you can also generate and use embeddings for semantic search directly within your existing &lt;/span&gt;&lt;a href="https://cloud.google.com/sql/postgresql"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;PostgreSQL&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or &lt;/span&gt;&lt;a href="https://cloud.google.com/sql/mysql"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;MySQL&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; instances.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The next two labs are very similar to the previous one, but instead of Google AlloyDB for PostgreSQL, we are using Cloud SQL for PostgreSQL and Cloud SQL for MySQL to use semantic search as the grounding engine for the model's response. Some steps are of course different due to variations in SQL language and different database engines, but the main idea stays the same: use our data to ground the model response and improve output.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-aside"&gt;&lt;dl&gt;
    &lt;dt&gt;aside_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;title&amp;#x27;, &amp;#x27;Go to the labs!&amp;#x27;), (&amp;#x27;body&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36023f6d90&amp;gt;), (&amp;#x27;btn_text&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;href&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;image&amp;#x27;, None)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Semantic search using text data is one of the cornerstones and important features making responses much more reliable and useful, but Google Gen AI models can offer much more. Let's talk about multimodal search.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Multimodal Embeddings: Bring Images to the Search&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In real life, of course, we use all our senses, including vision, to evaluate the world around us. The Google multimodal embedding models bring an additional layer of understanding, improving search by using embeddings not only for text but also for images.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the following lab, we use a catalog of products placed in AlloyDB and supplemented by images in Google Cloud Storage. In the lab, we show how we can use both text descriptions and images for semantic search, supplementing and replacing each other, naturally incorporating search based on image input into our response.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-aside"&gt;&lt;dl&gt;
    &lt;dt&gt;aside_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;title&amp;#x27;, &amp;#x27;Go to the lab!&amp;#x27;), (&amp;#x27;body&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36023f6790&amp;gt;), (&amp;#x27;btn_text&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;href&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;image&amp;#x27;, None)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Preparing the data and making first steps are important for a general understanding of RAG and tools available for the search, but Google has other cases when direct AI integration can help with your data analysis without any data preparations.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;AlloyDB AI Functions and Reranking&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Google AlloyDB database comes with additional AI integrations that help you use some AI capabilities without data preparation. For example, the AI.IF function can perform semantic search on the fly, evaluating sentiment or comparing data in columns with a natural language query, returning results filtered by the query condition. Also, you can apply a ranking function to the search output, improving the final result. You can try some of the new functionality using the following lab and let us know if it can help in your use case.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-aside"&gt;&lt;dl&gt;
    &lt;dt&gt;aside_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;title&amp;#x27;, &amp;#x27;Go to the lab!&amp;#x27;), (&amp;#x27;body&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ec6fa4c0&amp;gt;), (&amp;#x27;btn_text&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;href&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;image&amp;#x27;, None)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;But what if somebody is not particularly savvy with SQL or not familiar with the data structure in your database? The AlloyDB NL2SQL can help you with that.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Generate SQL using AlloyDB AI Natural Language&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The "alloydb_ai_nl" AlloyDB extension allows you not only to generate SQL queries based on default metadata available out-of-the-box but to build either automatic or custom context, helping to make the best of the query generation. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The NL2SQL functions can add a layer describing your data structure, relations between tables, and metadata based on real data samples from your tables without compromising the data itself, providing necessary information helping the AI model to understand how to build the best query. The following lab helps you to start with the new features and generate your first queries based on your data schema.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-aside"&gt;&lt;dl&gt;
    &lt;dt&gt;aside_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;title&amp;#x27;, &amp;#x27;Go to the lab!&amp;#x27;), (&amp;#x27;body&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35ec6fa760&amp;gt;), (&amp;#x27;btn_text&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;href&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;image&amp;#x27;, None)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;From Tests to Production&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Those labs are part of the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;From Data Foundations to Advanced RAG&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; module of  our &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/production-ready-ai-with-google-cloud-learning-path"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Production-Ready AI with Google Cloud&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; program. Check the other modules and see if they can help you to adopt the AI capabilities provided by our Google Cloud services and tools. The end game goal is a high quality application using the full potential of modern technologies.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;And stay tuned on release notes for ALloyDB and Cloud SQL - the engineering team is busy working on new features and improvements. Happy testing.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 02 Apr 2026 13:18:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/activating-your-data-layer-for-production-ready-ai/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/hero_new.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Activating Your Data Layer for Production-Ready AI</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/hero_new.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/activating-your-data-layer-for-production-ready-ai/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Gleb Otochkin</name><title>Cloud Advocate, Databases</title><department></department><company></company></author></item><item><title>Create Expert Content: Architect A Personalized Multi-Agent System with Long-Term Memory</title><link>https://cloud.google.com/blog/topics/developers-practitioners/multi-agent-architecture-and-long-term-memory-with-adk-mcp-and-cloud-run/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In support of our mission to accelerate the developer journey on Google Cloud, we built &lt;strong&gt;Dev Signal&lt;/strong&gt;—a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/build-a-multi-agent-system-for-expert-content-with-google-adk-mcp-and-cloud-run-part-1" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;first part&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; of this series for the &lt;strong&gt;Dev Signal&lt;/strong&gt;, we laid the essential groundwork for this system by establishing a project environment and equipping core capabilities through the Model Context Protocol (MCP). We standardized our external integrations, connecting to Reddit for trend discovery, Google Cloud Docs for technical grounding, and building a custom Nano Banana Pro MCP server for multimodal image generation. If you missed &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/build-a-multi-agent-system-for-expert-content-with-google-adk-mcp-and-cloud-run-part-1" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Part 1&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or want to explore the code directly, you can find the complete project implementation in our &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GitHub repository&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now, in Part 2, we focus on building the multi-agent architecture and integrating the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/agent-builder/agent-engine/memory-bank/overview?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI memory bank&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to personalize these capabilities. We will implement a Root Orchestrator that manages three specialist agents: the Reddit Scanner, GCP Expert, and Blog Drafter, to provide a seamless flow from trend discovery to expert content creation. We will also integrate a long-term memory layer that enables the agent to learn from your feedback and persist your stylistic preferences across different conversations. This ensures that Dev Signal doesn't just process data, but actually learns to match your professional voice over time.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Infrastructure and Model Setup&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;First, we initialize the environment and the shared Gemini model.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/agent.py&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt; &lt;/code&gt;&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;from google.adk.agents import Agent\r\nfrom google.adk.apps import App\r\nfrom google.adk.models import Gemini\r\nfrom google.adk.tools import google_search, AgentTool, load_memory_tool, preload_memory_tool\r\nfrom google.adk.tools.tool_context import ToolContext\r\nfrom google.genai import types\r\nfrom dev_signal_agent.app_utils.env import init_environment\r\nfrom dev_signal_agent.tools.mcp_config import (\r\n    get_reddit_mcp_toolset, \r\n    get_dk_mcp_toolset, \r\n    get_nano_banana_mcp_toolset\r\n)\r\n\r\nPROJECT_ID, MODEL_LOC, SERVICE_LOC, SECRETS = init_environment()\r\n\r\n\r\nshared_model = Gemini(\r\n    model=&amp;quot;gemini-3-flash-preview&amp;quot;, \r\n    vertexai=True, \r\n    project=PROJECT_ID, \r\n    location=MODEL_LOC,\r\n    retry_options=types.HttpRetryOptions(attempts=3),\r\n)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffc7f70&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Memory Ingestion Logic&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We want Dev Signal&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; to do more than just follow instructions - we want it to learn from you. By capturing your preferences, such as specific technical interests on Reddit or a preferred blogging style, the agent can personalize its output for future use. To achieve this, we use the &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/agent-builder/agent-engine/memory-bank/overview?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI memory bank&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to persist session history across different conversations.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;Long-term Memory&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We automate this through the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;save_session_to_memory_callback&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; function. This callback is configured to run automatically after every turn, ensuring that session details are captured and stored in the memory bank without manual intervention.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;Ho&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;w &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Managed Memory Works:&lt;/span&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Ingestion&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;save_session_to_memory_callback&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; sends the conversation data to Vertex AI.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Embedding&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Vertex AI converts the text into numerical vectors (embeddings) that capture the semantic meaning of your preferences.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Storage&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: These vectors are stored in a managed index, enabling the agent to perform semantic searches and retrieve relevant history in future sessions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Retrieval&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The agent recalls this history using built-in ADK tools. The PreloadMemoryTool proactively brings in context at the start of an interaction, while the LoadMemoryTool allows the agent to fetch specific memories on an as-needed basis.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/agent.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;async def save_session_to_memory_callback(*args, **kwargs) -&amp;gt; None:\r\n    &amp;quot;&amp;quot;&amp;quot;\r\n    Defensive callback to persist session history to the Vertex AI memory bank.\r\n    &amp;quot;&amp;quot;&amp;quot;\r\n    ctx = kwargs.get(&amp;quot;callback_context&amp;quot;) or (args[0] if args else None)\r\n    \r\n    # Check connection to Memory Service\r\n    if ctx and hasattr(ctx, &amp;quot;_invocation_context&amp;quot;) and ctx._invocation_context.memory_service:\r\n        # Save the session!\r\n        await ctx._invocation_context.memory_service.add_session_to_memory(\r\n            ctx._invocation_context.session\r\n        )&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffc7d00&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;Short-term Memory&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;add_info_to_state&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; function serves as the agent's short-term working memory, allowing the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;gcp_expert&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to reliably hand off its detailed findings to the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;blog_drafter&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; within the same session. This working memory and the conversation transcript are managed by the Vertex AI Session Service to ensure that active context survives server restarts or transient failures.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;The boundary between session-based state and long-term persistence &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;- It is important to note that while this service provides stability during an active interaction, this short-term memory does not persist between different sessions. Starting a fresh session ID effectively resets this working state, ensuring a clean slate for new tasks. Cross-session continuity, where the agent remembers your stylistic preferences or past feedback, is handled by the Vertex AI Memory Bank.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/agent.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;def add_info_to_state(tool_context: ToolContext, key: str, data: str) -&amp;gt; dict:\r\n    tool_context.state[key] = data\r\n    return {&amp;quot;status&amp;quot;: &amp;quot;success&amp;quot;, &amp;quot;message&amp;quot;: f&amp;quot;Saved \&amp;#x27;{key}\&amp;#x27; to state.&amp;quot;}&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffc7df0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Specialist 1: Reddit Scanner (Discovery)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The Reddit scanner is our “Trend Spotter," it identifies high-engagement questions from the last 21 days (3 weeks) to ensure that all research findings remain both timely and relevant.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Memory Usage:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; It leverages &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;load_memory&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to retrieve your past areas of interest and preferred topics from the Vertex AI memory bank If relevant history exists, the agent prioritizes those specific topics in its search to provide a personalized discovery experience.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Beyond simple retrieval, each sub-agent actively updates its memories by listening for new preferences and explicitly acknowledging them during the chat. This process captures relevant information in the session history, where an automated callback then persists it to the long-term Vertex AI memory bank for future use.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This memory management is supported by two distinct retrieval patterns within the Google Agent Development Kit (ADK). The first is the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;PreloadMemoryTool&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, which proactively brings in historical context at the beginning of every interaction to ensure the agent is fully briefed before addressing the current request. The second is the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;LoadMemoryTool&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, which the agent uses on an as-needed basis, calling upon it only when it decides that deeper past knowledge would be beneficial for the current step in the workflow.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/agent.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# Singleton toolsets\r\nreddit_mcp = get_reddit_mcp_toolset(\r\n    client_id=SECRETS.get(&amp;quot;REDDIT_CLIENT_ID&amp;quot;, &amp;quot;&amp;quot;),\r\n    client_secret=SECRETS.get(&amp;quot;REDDIT_CLIENT_SECRET&amp;quot;, &amp;quot;&amp;quot;),\r\n    user_agent=SECRETS.get(&amp;quot;REDDIT_USER_AGENT&amp;quot;, &amp;quot;&amp;quot;)\r\n)\r\nreddit_scanner = Agent(\r\n    name=&amp;quot;reddit_scanner&amp;quot;,\r\n    model=shared_model,\r\n    instruction=&amp;quot;&amp;quot;&amp;quot;\r\n    You are a Reddit research specialist. Your goal is to identify high-engagement questions \r\n    from the last 3 weeks on specific topics of interest, such as AI/agents on Cloud Run.\r\n    \r\n    Follow these steps:\r\n    1. **MEMORY CHECK**: Use `load_memory` to retrieve the user\&amp;#x27;s **past areas of interest** and **preferred topics**. Calibrate your search to align with these interests.\r\n    2. Use the Reddit MCP tools to search for relevant subreddits and posts.\r\n    3. Filter results for posts created within the last 21 days (3 weeks).\r\n    4. Analyze &amp;quot;high-engagement&amp;quot; based on upvote counts and the number of comments.\r\n    5. Recommend the most important and relevant questions for a technical audience.\r\n    6. **CRITICAL**: For each recommended question, provide a direct link to the original thread and a concise summary of the discussion.\r\n    7. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.\r\n    &amp;quot;&amp;quot;&amp;quot;,\r\n    tools=[reddit_mcp, load_memory_tool.LoadMemoryTool()],\r\n    after_agent_callback=save_session_to_memory_callback,\r\n)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffc7b80&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Specialist 2: GCP Expert (Grounding)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The GCP expert is our "The Technical Authority". It triangulates facts by synthesizing official documentation from the Google Cloud Developer Knowledge MCP Server, community sentiment from Reddit, and broader context from Google Search.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/agent.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;dk_mcp = get_dk_mcp_toolset(api_key=SECRETS.get(&amp;quot;DK_API_KEY&amp;quot;, &amp;quot;&amp;quot;))\r\n\r\n\r\nsearch_agent = Agent(\r\n    name=&amp;quot;search_agent&amp;quot;,\r\n    model=shared_model,\r\n    instruction=&amp;quot;Execute Google Searches and return raw, structured results (Title, Link, Snippet).&amp;quot;,\r\n    tools=[google_search],\r\n)\r\ngcp_expert = Agent(\r\n    name=&amp;quot;gcp_expert&amp;quot;,\r\n    model=shared_model,\r\n    instruction=&amp;quot;&amp;quot;&amp;quot;\r\n    You are a Google Cloud Platform (GCP) documentation expert. \r\n    Your goal is to provide accurate, detailed, and cited answers to technical questions by synthesizing official documentation with community insights.\r\n    \r\n    For EVERY technical question, you MUST perform a comprehensive research sweep using ALL available tools:\r\n    \r\n    1. **Official Docs (Grounding)**: Use DeveloperKnowledge MCP (`search_documents`) to find the definitive technical facts.\r\n    2. **Social Media Research (Reddit)**: Use the Reddit MCP to research the question on social media. This allows you to find real-world user discussions, common pain points, or alternative solutions that might not be in official documentation.\r\n    3. **Broader Context (Web/Social)**: Use the `search_agent` tool to find recent technical blogs, social media discussions, or tutorials.\r\n    \r\n    Synthesize your answer:\r\n    - Start with the official answer based on GCP docs.\r\n    - Add &amp;quot;Social Media Insights&amp;quot; or &amp;quot;Common Issues&amp;quot; sections derived from Reddit and Web Search findings.\r\n    - **CRITICAL**: After providing your answer, you MUST use the `add_info_to_state` tool to save your full technical response under the key: `technical_research_findings`.\r\n    - Cite your sources specifically at the end of your response, providing **direct links** (URLs) to the official documentation, blog posts, and Reddit threads used.\r\n    - **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.\r\n    &amp;quot;&amp;quot;&amp;quot;,\r\n    tools=[dk_mcp, AgentTool(search_agent), reddit_mcp, add_info_to_state],\r\n    after_agent_callback=save_session_to_memory_callback,\r\n)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35fffc7c10&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt; Specialist 3: Blog Drafter (Creativity)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The blog drafter is our Content Creator. It drafts the blog based on the expert's findings and offers to generate visuals.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Memory Usage&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: It checks &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;load_memory&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; for the user's &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;preferred writing style&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (e.g. "Witty", "Rap") stored in the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Vertex AI memory bank&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/agent.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;nano_mcp = get_nano_banana_mcp_toolset()\r\n\r\n\r\nblog_drafter = Agent(\r\n    name=&amp;quot;blog_drafter&amp;quot;,\r\n    model=shared_model,\r\n    instruction=&amp;quot;&amp;quot;&amp;quot;\r\n    You are a professional technical blogger specializing in Google Cloud Platform. \r\n    Your goal is to draft high-quality blog posts based on technical research provided by the GDE expert and reliable documentation.\r\n    \r\n    You have access to the research findings from the gcp_expert_agent here:\r\n    {{ technical_research_findings }}\r\n \r\n    Follow these steps:\r\n    1. **MEMORY CHECK**: Use `load_memory` to retrieve past blog posts, **areas of interest**, and user feedback on writing style. Adopt the user\&amp;#x27;s preferred style and depth.\r\n    2. **REVIEW &amp;amp; GROUND**: Review the technical research findings provided above. **CRITICAL**: Use the `dk_mcp` (Developer Knowledge) tool to verify key facts, technical limitations, and API details. Ensure every claim in your blog is grounded in official documentation.\r\n    3. Draft a blog post that is engaging, accurate, and helpful for a technical audience.\r\n    4. Include code snippets or architectural diagrams if relevant.\r\n    5. Provide a &amp;quot;Resources&amp;quot; section with links to the official documentation used.\r\n    6. Ensure the tone is professional yet accessible, while adhering to any style preferences found in memory.\r\n    7. **VISUALS**: After presenting the drafted blog post, explicitly ask the user: &amp;quot;Would you like me to generate an infographic-style header image to illustrate these key points?&amp;quot; If they agree, use the `generate_image` tool (Nano Banana).\r\n    8. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.\r\n    &amp;quot;&amp;quot;&amp;quot;,\r\n    tools=[dk_mcp, load_memory_tool.LoadMemoryTool(), nano_mcp],\r\n    after_agent_callback=save_session_to_memory_callback,\r\n)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3600e56190&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The Root Orchestrator&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The root agent serves as the system's strategist, managing a team of specialist agents and orchestrating their actions based on the specific goals provided by the user. At the start of a conversation, the orchestrator retrieves memory to establish context by checking for the user's past areas of interest, preferred topics, or previous projects. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;/agent.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;root_agent = Agent(\r\n    name=&amp;quot;root_orchestrator&amp;quot;,\r\n    model=shared_model,\r\n    instruction=&amp;quot;&amp;quot;&amp;quot;\r\n    You are a technical content strategist. You manage three specialists:\r\n    1. reddit_scanner: Finds trending questions and high-engagement topics on Reddit.\r\n    2. gcp_expert: Provides technical answers based on official GCP documentation.\r\n    3. blog_drafter: Writes professional blog posts based on technical research.\r\n \r\n    Your responsibilities:\r\n    - **MEMORY CHECK**: At the start of a conversation, use `load_memory` to check if the user has specific **areas of interest**, preferred topics, or past projects. Tailor your suggestions accordingly.\r\n    - **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.\r\n    - If the user wants to find trending topics or questions from Reddit, delegate to reddit_scanner.\r\n    - If the user has a technical question or wants to research a specific theme, delegate to gcp_expert.\r\n    - **CRITICAL**: After the gcp_expert provides an answer, you MUST ask the user: \r\n      &amp;quot;Would you like me to draft a technical blog post based on this answer?&amp;quot;\r\n    - If the user agrees or asks to write a blog, delegate to blog_drafter.\r\n    - Be proactive in helping the user navigate from discovery (Reddit) to research (Docs) to content creation (Blog).\r\n    &amp;quot;&amp;quot;&amp;quot;,\r\n    tools=[load_memory_tool.LoadMemoryTool(), preload_memory_tool.PreloadMemoryTool()],\r\n    after_agent_callback=save_session_to_memory_callback,\r\n    sub_agents=[reddit_scanner, gcp_expert, blog_drafter]\r\n)\r\n\r\napp = App(root_agent=root_agent, name=&amp;quot;dev_signal_agent&amp;quot;)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3600e56a00&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Summary&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In this part of our series, we built multi-agent architecture and implemented a robust, dual-layered memory system. We established a Root Orchestrator, managing three specialist agents: a Reddit Scanner for trend discovery, a GCP Expert for technical grounding, and a Blog Drafter for creative content creation. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By utilizing short-term state to pass information reliably between specialists and integrating the Vertex AI memory bank for long-term persistence, we’ve enabled the agent to learn from your feedback and remember specific writing styles across different conversations. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/create-expert-content-local-testing-of-a-multi-agent-system-with-memory"&gt;part 3&lt;/a&gt;, we will show you how to test the agent locally to verify these components on your workstation, before transitioning to a full production deployment on Google Cloud Run in part 4. Can't wait for Part 3? The full implementation is already available for you to explore on &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GitHub&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To learn more about the underlying technology, explore the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/agent-builder/agent-engine/memory-bank/overview?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI Memory Bank overview&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or dive into the official &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/agent-builder/agent-development-kit/overview?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;ADK Documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to see how to orchestrate complex multi-agent workflows.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Special thanks to &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/remigiusz-samborski/" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;Remigiusz Samborski&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt; for the helpful review and feedback on this article.&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For more content like this, Follow me on &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/shirmeirlador/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Linkedin&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://x.com/shirmeir86?lang=en" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;X&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Tue, 31 Mar 2026 09:31:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/multi-agent-architecture-and-long-term-memory-with-adk-mcp-and-cloud-run/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/devsignalheroimage.max-600x600.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Create Expert Content: Architect A Personalized Multi-Agent System with Long-Term Memory</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/devsignalheroimage.max-600x600.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/multi-agent-architecture-and-long-term-memory-with-adk-mcp-and-cloud-run/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Shir Meir Lador</name><title>Head of AI, Product DevRel</title><department></department><company></company></author></item><item><title>Five techniques to reach the efficient frontier of LLM inference</title><link>https://cloud.google.com/blog/topics/developers-practitioners/five-techniques-to-reach-the-efficient-frontier-of-llm-inference/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Every dollar that you spend on model inference buys you a position on a graph of latency and throughput. On this plot is a curve of optimal configurations, where you've squeezed the maximum possible performance from your hardware. That curve, borrowed from portfolio theory in finance, is the &lt;/span&gt;&lt;a href="https://en.wikipedia.org/wiki/Efficient_frontier" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;efficient frontier&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With the assumption that you have a fixed budget for hardware, you can trade latency for throughput. But, you can't improve one aspect without sacrificing the other, unless the frontier curve itself moves. There are two fundamentally different dynamics at play, and this is the central insight for anyone running LLMs in production.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The first dynamic is &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;getting to the frontier&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, which involves applying the full stack of techniques available to you today. This part is within your control. &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tensortllm?utm_campaign=CDR_0x2b6f3004_default&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Continuous batching&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/best-practices/machine-learning/inference/llm-optimization#model-memory"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;paged attention&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway?utm_campaign=CDR_0x2b6f3004_default&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;intelligent routing&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;a href="https://cloud.google.com/vertex-ai/docs/blog/posts/from-research-to-production-accelerate-oss-llm-with-eagle-3-on-vertex?utm_campaign=CDR_0x2b6f3004_default&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;speculative decoding&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/best-practices/machine-learning/inference/llm-optimization#quantization"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;quantization&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; all exist right now. If you're not using these techniques, you're operating below the frontier and leaving performance on the table.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The second dynamic is that &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;the frontier itself is constantly moving outward&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. This part is largely outside of your control. Researchers publish new algorithms. Hardware vendors ship new architectures. Open-source projects mature. Each breakthrough redefines what's physically achievable and expands the curve so that yesterday's optimal configuration is today's inefficiency.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Your job as a platform engineer is to stay as close to the frontier as possible as you build infrastructure that's flexible enough to absorb each new advance as it arrives. This article gives you the tools to do just that.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Why inference has an efficient frontier&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Every LLM request has two computational phases, and they can have bottlenecks for different hardware resources.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Prefill (Compute-Bound)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: In this phase, the GPU processes your entire input prompt at one time to build the &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/best-practices/machine-learning/inference/llm-optimization#attention-layer-optimization?utm_campaign=CDR_0x2b6f3004_default&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;key-value (KV) cache&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for the attention mechanism. Because the instructions are batch-processed in parallel, the GPU's compute cores (tensor cores) are highly utilized. This phase is fast and efficient: the processors have all of the data that they need, immediately available, to perform massive matrix multiplications. Longer prompts just mean more computations.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Decode (Memory-Bandwidth-Bound)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: This phase generates new tokens, one at a time, &lt;/span&gt;&lt;a href="https://en.wikipedia.org/wiki/Autoregressive_model" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;autoregressively&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. To generate only one single token, the GPU can't batch the work. It must fetch the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;entire&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; model's weights and the growing KV cache from &lt;/span&gt;&lt;a href="https://en.wikipedia.org/wiki/High_Bandwidth_Memory" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;High-Bandwidth Memory (HBM)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; into the compute cores. Then, the GPU needs to calculate that one token, and then waits to do it all over again for the next one.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This mismatch is the fundamental reason that the frontier exists. You can't optimize a single system for both phases simultaneously without making some tradeoffs.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/prefill-vs-decode.max-1000x1000.jpg"
        
          alt="prefill-vs-decode"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The two axes of inference&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Instead of risk and return, the efficient frontier of LLM inference measures a different fundamental tradeoff, with the assumption that the hardware budget is fixed:&lt;/span&gt;&lt;/p&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;&lt;table border="1" style="border-collapse: collapse; width: 99.9748%;"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="width: 31.5124%;"&gt;&lt;strong&gt;Axis&lt;/strong&gt;&lt;/td&gt;
&lt;td style="width: 31.5124%;"&gt;&lt;a href="https://bentoml.com/llm/inference-optimization/llm-inference-metrics" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Key metrics measured&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td style="width: 31.5133%;"&gt;&lt;strong&gt;Hardware constraint&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="width: 31.5124%;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Latency (the X-Axis)&lt;/strong&gt;&lt;/td&gt;
&lt;td style="width: 31.5124%;"&gt;&lt;span style="vertical-align: baseline;"&gt;Time to First Token (TTFT) + Time Between Tokens (TBT)&lt;/span&gt;&lt;/td&gt;
&lt;td style="width: 31.5133%;"&gt;&lt;span style="vertical-align: baseline;"&gt;Compute (prefill) and memory bandwidth (decode)&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="width: 31.5124%;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Throughput (the Y-Axis)&lt;/strong&gt;&lt;/td&gt;
&lt;td style="width: 31.5124%;"&gt;&lt;span style="vertical-align: baseline;"&gt;Total tokens per second across all concurrent users&lt;/span&gt;&lt;/td&gt;
&lt;td style="width: 31.5133%;"&gt;&lt;span style="vertical-align: baseline;"&gt;Batch size × memory capacity&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Cost is the constraint that buys the graph of latency and throughput itself. If you increase your hardware budget, or the industry invents a new algorithmic breakthrough, the entire frontier curve shifts outward. For a given budget and software stack, you can apply today's best practices to move from a sub-optimal point &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;towards&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; that frontier.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Getting to the frontier: Five techniques within your control&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Most production inference systems today operate &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;below&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; the frontier. They're leaving performance on the table, not because better techniques don't exist, but because they haven't adopted them yet. Everything described in this section is available today. If you're not applying these techniques, you're choosing to operate below the curve.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/interventions.max-1000x1000.jpg"
        
          alt="interventions"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;1. Semantic routing across model tiers&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Not every query needs a 400B parameter model. Simple classification, summarization, or formatting tasks can be routed to smaller, quantized models that are orders of magnitude cheaper per token. A lightweight classifier at the gateway edge analyzes query complexity and routes accordingly: frontier-class models for hard reasoning, and small models for everything else.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://cloud.google.com/blog/products/containers-kubernetes/how-gke-inference-gateway-improved-latency-for-vertex-ai?utm_campaign=CDR_0x2b6f3004_default&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Semantic routing&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; pushes your system dramatically closer to its theoretical maximum throughput, and avoids wasted cycles on easy tasks, without sacrificing aggregate output quality.&lt;/span&gt;&lt;/p&gt;
&lt;h3 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;2. Prefill and decode disaggregation&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Physically separating prefill and decode phases onto different hardware is one of the most architecturally significant optimizations available today.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The prefill phase needs compute-dense GPUs. The decode phase needs high-bandwidth memory. If you force both phases onto the same GPU, then one resource is always underutilized.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To push both phases toward their theoretical hardware limits independently, run dedicated prefill clusters and decode clusters. Connect these clusters with high-speed networks that transfer only the compressed KV cache state to the same GPU, then one resource is always underutilized.&lt;/span&gt;&lt;/p&gt;
&lt;h3 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;3. Quantization: Trading precision for speed&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When you &lt;/span&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/best-practices/machine-learning/inference/llm-optimization#quantization?utm_campaign=CDR_0x2b6f3004_default&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;reduce model weights&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; from FP16 to the INT8 or INT4 formats, you can reduce the memory footprint to half or a quarter. Because the decode phase is memory-bandwidth-bound, 4-bit weights can be read up to 4× faster than 16-bit weights. This approach provides a direct TBT improvement.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The tradeoff is quality because naive quantization degrades model outputs. Modern techniques like Activation-aware Weight Quantization (AWQ) and GPTQ preserve the quality of sensitive weights, but aggressively compress others, to achieve near-FP16 quality at INT4 speeds.&lt;/span&gt;&lt;/p&gt;
&lt;h3 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;4. Context routing: The biggest lever that most teams miss&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In a production deployment with dozens of model replicas, the&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt; routing layer &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;is where the biggest competitive advantages are won or lost today.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In 2026, &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/open-models/model-garden-published-notebooks/model_garden_advanced_features#prefix_caching_"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;prefix caching&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is foundational. If ten users ask questions about the exact same 100-page RAG document, or use the identical massive system prompt, you shouldn't run the compute-heavy prefill phase ten times. You should compute the KV cache once, store it, and then let the other nine users reuse it. This approach slashes TTFT by up to 85% and drastically reduces compute costs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;But, there's a catch: a standard L4 load balancer scatters requests randomly. If user 2's request lands on a different GPU than user 1's request, the prefix cache is useless. The system has to recompute the cache from scratch.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This is why &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;context-aware L7 routing&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; is the differentiator. An intelligent router inspects the incoming prompt's prefix and intentionally routes the request to the specific pod that &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;already holds that context in its cache&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. You stop wasting compute power on redundant work and instantly push your latency and throughput closer to the physical limits of your hardware.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/prefix-aware-routing.max-1000x1000.jpg"
        
          alt="prefix-aware-routing"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;5. Speculative decoding&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Remember: during the decode phase, tensor cores are mostly idle because there's a bottleneck on memory bandwidth. &lt;/span&gt;&lt;a href="https://cloud.google.com/vertex-ai/docs/blog/posts/from-research-to-production-accelerate-oss-llm-with-eagle-3-on-vertex?utm_campaign=CDR_0x2b6f3004_default&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Speculative decoding&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; exploits this wasted computation power.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A small, fast "draft" model generates several candidate tokens cheaply. The large target model then &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;verifies&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; all of the candidates in a single forward pass, which is a parallel compute-bound operation, rather than a sequential memory-bound one. If the draft model predicted the candidates correctly, you've generated 4-5 tokens for the memory cost of one.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This approach directly breaks the TBT floor set by memory bandwidth. If you're not using speculative decoding for latency-sensitive workloads, you're not leveraging one of the most impactful optimizations available.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Although the addition of a draft model can introduce some operational complexity and slightly increase compute costs, the draft model is relatively tiny compared to the main model. This tradeoff for latency is worthwhile.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Note that some newer models have introduced &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;self-speculative decoding&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, which eliminates the overhead of managing a second model. These models use specialized internal layers (often called &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;prediction heads&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;) that are trained to predict extra future tokens simultaneously. These models generally achieve a highly meaningful token hit rate.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Case study: How Vertex AI moved closer to the frontier&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The Vertex AI engineering team moved closer to the frontier when they adopted &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/containers-kubernetes/how-gke-inference-gateway-improved-latency-for-vertex-ai?utm_campaign=CDR_0x2b6f3004_default&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GKE Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which is built on the standard Kubernetes Gateway API. Inference Gateway intercepted requests at Layer 7 and added two critical layers of intelligence:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Load-aware routing&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: It scraped real-time metrics (like KV cache utilization and queue depth) directly from the model server's Prometheus endpoints. This process routes requests to the pod that can serve them the fastest.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Content-aware routing&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Crucially, it inspected request prefixes and routed traffic to the pod that already held that specific context in its KV cache. This process avoids expensive re-computation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When the production workloads were migrated to this intelligent routing architecture, the Vertex AI team proved that optimizing the network layer is key to unlocking performance at scale. Validated on production traffic, the results were stark:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;35% faster TTFT&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; for Qwen3-Coder (context-heavy coding agent workloads)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;2x better P95 tail latency&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (52% improvement) for DeepSeek V3.1 (bursty chat workloads)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Doubled prefix cache hit rate&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (optimized from 35% to 70%)&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;The bottom line&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;LLM inference has an efficient frontier, which represents a hard boundary where latency and throughput are optimally balanced for a given compute budget.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Getting to that frontier is within your control&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. The techniques exist today: continuous batching, paged attention, intelligent L7 routing, speculative decoding, quantization, and prefill and decode disaggregation. The GKE Inference Gateway case study shows that routing alone, without changing hardware, models, or cluster size, cut TTFT by 35% and doubled cache efficiency. If you're not applying the full stack, you're operating below the curve and overpaying for every token.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;The frontier itself keeps moving outward&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. This part is outside of your control. Researchers publish new algorithms, hardware vendors ship new architectures, and open-source serving frameworks integrate these algorithms and architectures. Something that was cutting-edge optimization 18 months ago became a baseline table stake. Your job isn't to predict which breakthrough comes next; it's to build infrastructure flexible enough to absorb it when it arrives.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The organizations that will win on inference economics aren't the ones with the most GPUs. They're the ones that systematically close the gap to today's frontier while they stay ready for tomorrow's.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Have you applied any of these optimization techniques to your own LLM inference workloads? I'd love to hear about your experience! Share what you've built with me on &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;LinkedIn&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;a href="https://x.com/kweinmeister" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;X&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;, or &lt;/span&gt;&lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;Bluesky&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;!&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Fri, 27 Mar 2026 10:02:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/five-techniques-to-reach-the-efficient-frontier-of-llm-inference/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/hero-image.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Five techniques to reach the efficient frontier of LLM inference</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/hero-image.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/five-techniques-to-reach-the-efficient-frontier-of-llm-inference/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Karl Weinmeister</name><title>Director, Developer Relations</title><department></department><company></company></author></item><item><title>The new AI literacy: Insights from student developers</title><link>https://cloud.google.com/blog/topics/developers-practitioners/how-uc-berkeley-students-use-ai-as-a-learning-partner/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;AI has made it easier than ever for student developers to work efficiently, tackle harder problems, and pursue ambitious projects. But for students earning technical degrees, these new capabilities also create genuine tensions around learning. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;How much should I use AI? What should I use it for? &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As 90% of technology professionals now use AI in their daily work according to &lt;/span&gt;&lt;a href="https://dora.dev/dora-report-2025/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google's DORA 2025 report&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, understanding how the next generation navigates these tools matters more than ever. Contrary to fears that students use AI to cheat or are becoming intellectually lazy, our research with UC Berkeley students reveals something different. Students treated AI as a learning partner rather than a shortcut, using it strategically for some tasks while deliberately turning it off for others. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As AI becomes foundational to software development, the question isn't whether to adopt these tools but how to work with them thoughtfully. The students at UC Berkeley are showing us one answer: with curiosity, caution, and a commitment to genuine learning that technology can support but never replace.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The research&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our team of four student researchers (Andrew Harlan, Mindy Tsai, Kenny Ly, and Karissa Wong) conducted a mixed methods research project with UC Berkeley students in Computer Science, Electrical Engineering, Design, and Data Science to understand how they're integrating AI into their academic work. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A separate UC Berkeley study (conducted by Edward Fraser, Jessie Deng, and Eileen Thai) used eye-tracking technology to observe how developers with one to five years of experience actually interact with AI coding assistants. Both student teams were supported by dedicated mentors, with Googlers Harini Sampath, Becky Sohn, and Derek DeBellis advising the mixed methods research, and UC Berkeley Professor John Chuang, PhD, advising the eye-tracking study.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Together, these studies reveal three key insights about how students balance AI's capabilities with their need to develop genuine expertise. The patterns emerging among students closely mirror what DORA research has found in professional developers.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Finding #1: The 24/7 office hour&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;AI as a tutor, not a shortcut&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When asked to describe their relationship with AI, every student in our study used educational terms. They referred to  AI as a "tutor" or "teacher," not an assistant or productivity tool.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"AI is a teacher...in the sense that it is most helpful for understanding dense content and potentially parts of code that are prewritten in the database to allow for fundamental understanding of the project."&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"I use [AI] as my own private tutor...to [cover] any specific topics in the classes or lectures...not just in CS classes but in all classes."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This framing matters because it reveals strategic use rather than dependency. Rather than asking AI to complete assignments, students described using AI metacognitively to identify gaps in their knowledge, clarify confusing concepts, and guide their learning process. They used AI to summarize academic papers mentioned in lectures so they could decide which ones warranted deeper reading. They asked AI to explain why their code produced specific errors.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;One student explained their workflow:&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"When I don't understand what my professor is explaining, I ask AI to help me understand the concept or what a piece of code is doing. If I don't know how to begin a lab, I give the prompt to AI to figure out where to start, then write the code myself and ask AI to correct my work."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For students with learning disabilities, this constant availability addresses a real access gap:&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"As a student with a learning disability, I need more time to understand a problem. AI has helped me a lot—it's like having a 24/7 TA."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By extending access beyond limited office hours, AI allows students to iterate on their understanding without waiting for help. This frees up cognitive space for higher-level thinking:&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"I spend less time actually coding and more time on big picture ideation. Now, my time is spent thinking through logic, concepts, and coming up with ideas creatively, rather than producing code manually."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These accounts portray AI as a scaffold for exploration rather than a producer of finished work. This mirrors what DORA research found: when AI handles routine toil, developers can focus more energy on delivering user value.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Finding #2: Active resistance to overdependence&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;Building guardrails to protect learning&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Despite embracing AI as a learning tool, students expressed genuine anxiety about becoming too dependent on it.&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"If AI disappeared, I'd struggle more with figuring out how to solve things on my own."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In a recent study using EEG to measure brain activity during essay writing, researchers found that AI users showed weaker cognitive engagement patterns compared to those using search engines or no tools, and frequent AI users who later wrote without assistance remembered less of their content and felt less ownership over it, what the authors termed "cognitive debt”.&lt;sup&gt;1&lt;/sup&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our research revealed a positive signal: rather than passively accepting this risk, students responded by establishing deliberate boundaries.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;One mechanical engineering student described how she's developed a competency-based system over years of working with electronics: &lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"When I use basic sensors like a servo or ultrasonic, I can still code that myself. But when I have more complex sensors where I don't necessarily know the exact functions, that's when I'll use AI." She explained her reasoning: "I have the background to understand why things aren't working, but I don't always know the direct language to fix it, so AI is good for helping overcome that."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For a recent project building a tactile storytelling tool, she knew the basic concept but needed help structuring the counting and comparison system. "AI was really useful in setting up that structure, but I still had to code after to fine-tune it." She's clear about the division of labor: "I'm still working with doing the code myself. I wouldn't say that I'm just handing it off like a technical expert. I'm working in tandem with it. I have to be the initiator of what I want it to actually do. If I just give it a blind request, it's not useful at all."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Even when students do engage AI, they often set explicit rules:&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"Sometimes I tell AI not to give me the full answer, just to guide me in the right direction."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Students have developed several specific strategies to prevent overreliance:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Limiting access to powerful models:&lt;/strong&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"I don't want to pay for AI tools because it could lead me to overuse the models."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Alternating between assisted and unassisted work:&lt;/strong&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"I have actually gone back to hand-coding for certain things, like a for-loop for example."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Warning against "vibe coding":&lt;/strong&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"AI tools can definitely be a good companion to boost developer productivity. However, one needs to be very mindful and not get used to vibe coding. It's very important to understand and validate the code AI is generating and use it appropriately."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This anxiety is itself metacognitive awareness. Students recognize that the path of least resistance may not be the path of greatest learning. This mirrors DORA's findings: despite 90% adoption, about 30% of practitioners report little to no trust in AI-generated code. Effective AI use requires mastering critical evaluation and verification, not just adoption.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Finding #3: Knowing when to use AI and when to turn it off&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;&lt;span style="vertical-align: baseline;"&gt;What the eye-tracking data reveals&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;A separate study using eye-tracking technology provides behavioral validation. When researchers observed developers with one to five years of experience interacting with AI coding assistants, they found stark differences in AI engagement depending on task type:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;During interpretive tasks&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; requiring deep understanding: &amp;lt;1% visual attention on AI&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;During mechanical tasks&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; like boilerplate code: 19% visual attention on AI&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Developers actively ignored AI suggestions during complex work, even when those suggestions were accurate and could save time. AI creates cognitive load during deep understanding work, and experienced developers know when to turn it off.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Strategic selectivity, not blanket adoption&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Students in our interviews echoed this context-dependent approach:&lt;/span&gt;&lt;/p&gt;
&lt;p style="padding-left: 40px;"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"I typically use AI to generate ideas for a starting point."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"Despite knowing AI was allowed, I wanted to go through the friction of learning and failing and having space for creativity."&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Customization matters&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Most AI coding assistants now let developers toggle inline suggestions, enable on-demand only modes, or adjust suggestion frequency. By experimenting with these settings, developers can align AI behavior with the cognitive demands of different tasks, reducing disruption during deep work while maintaining assistance for routine tasks.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;What this means for the industry&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Students are modeling the future of AI-augmented development&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The students in these studies are ahead of the curve. They've developed a literacy that knows when to engage AI, how to verify its output, and when to work manually to preserve understanding. For teams navigating AI adoption, the student experience offers direction:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Experiment with customization&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to find configurations that support rather than disrupt work&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Build verification practices&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; into workflows rather than accepting suggestions uncritically&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Create space for unassisted work&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; on complex problems where understanding matters more than speed&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As AI becomes foundational to software development, the question isn't whether to adopt these tools but how to work with them thoughtfully. The students at UC Berkeley are showing us one answer: with curiosity, caution, and a commitment to genuine learning that technology can support but never replace.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To learn more about how professionals across the industry are navigating AI adoption, &lt;/span&gt;&lt;a href="https://dora.dev/dora-report-2025/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;download the DORA 2025 State of AI-assisted Software Development Report&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. You can also &lt;/span&gt;&lt;a href="https://dora.dev/insights/tags/uc-berkeley/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;read the full research articles from our collaboration with researchers at UC Berkeley.&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;hr/&gt;
&lt;p&gt;&lt;sup&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;1. Kosmyna, Nataliya, et al. "Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task." &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;arXiv&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, 10 June 2025, doi:10.48550/arXiv.2506.08872. Accessed 28 Jan. 2026.&lt;/span&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 26 Mar 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/how-uc-berkeley-students-use-ai-as-a-learning-partner/</guid><category>AI &amp; Machine Learning</category><category>Developers &amp; Practitioners</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>The new AI literacy: Insights from student developers</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/how-uc-berkeley-students-use-ai-as-a-learning-partner/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Andrew Harlan, Ph.D.</name><title>UX Researcher &amp; Creative Technologist, Independent</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Steve Fadden, Ph.D.</name><title>UX Research Lead, Google</title><department></department><company></company></author></item><item><title>Building Distributed AI Agents</title><link>https://cloud.google.com/blog/topics/developers-practitioners/building-distributed-ai-agents/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Let's be honest: building an AI agent that works &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;once&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; is easy. Building an AI agent that works &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;reliably&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; in production, integrated with your existing React or Node.js application? That's a whole different ball game.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;(TL;DR: Want to jump straight to the code? Check out the &lt;/span&gt;&lt;a href="https://github.com/amitkmaraj/course-creation-ai-agent-architecture" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;Course Creator Agent Architecture on GitHub&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;.)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We've all been there. You have a complex workflow—maybe it's researching a topic, generating content, and then grading it. You shove it all into one massive Python script or a giant prompt. It works on your machine, but the moment you try to hook it up to your sleek frontend, things get messy. Latency spikes, debugging becomes a nightmare, and scaling is impossible without duplicating the entire monolith.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;But what if you didn't have to rewrite your entire application to accommodate AI? What if you could just... plug it in?&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In this post, we're going to explore a better way: the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;orchestrator pattern&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. Instead of just one powerful agent that does everything, we'll build a team of specialized, distributed microservices. This approach lets you integrate powerful AI capabilities directly into your existing frontend applications without the headache of a monolithic rewrite.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We'll use Google's &lt;/span&gt;&lt;a href="https://github.com/google/adk-python" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent Development Kit (ADK)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to build the agents, the &lt;/span&gt;&lt;a href="https://a2a-protocol.org" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent-to-Agent (A2A)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; protocol to connect them and let them communicate with each other, and deploy them as scalable microservices on &lt;/span&gt;&lt;a href="https://cloud.google.com/run"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Run&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Why Distributed Agents? (And Why Your Frontend Team Will Love You)&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Imagine you have a polished Next.js application. You want to add a "Course Creator" feature.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If you build a monolithic agent, your frontend has to wait for a single, long-running process to finish everything. If the research part hangs, the whole request times out. Additionally, you won’t have the opportunity to scale separate agents as needed. For example, if your judge agent requires more processing, you’ll have to scale &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;all&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; your agents up, instead of just the judge agent.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By adopting a distributed orchestrator pattern, you gain scalability and flexibility:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Seamless integration:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Your frontend talks to one endpoint (the orchestrator), which manages the chaos behind the scenes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Independent scaling:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Is the judge step slow? Scale just that service to 100 instances. Your research service can stay small.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Modularity:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; You can write the high-performance networking parts in Go and the data science parts in Python. They just speak HTTP.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;The Blueprint: Course Creator App&lt;/span&gt;&lt;/h2&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/building-distributed-ai-agents-course-creator.gif"
        
          alt="building-distributed-ai-agents-course-creator"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Let's build that course creator system. We'll break it down into three distinct specialists:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The researcher&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: A specialist that digs up information.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The judge&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: A QA specialist that ensures quality.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The orchestrator&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The manager that coordinates the work and talks to your frontend.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Step 1: Hiring the Specialist (The Researcher)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;First, we need someone to do the legwork. We'll build a focused agent using ADK whose only job is to use Google Search.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# researcher/app/agent.py\r\nfrom google.adk.agents import Agent\r\nfrom google.adk.tools import google_search\r\n\r\nresearcher = Agent(\r\n    name=&amp;quot;researcher&amp;quot;,\r\n    model=&amp;quot;gemini-2.5-flash&amp;quot;,\r\n    description=&amp;quot;Gathers information on a topic using Google Search.&amp;quot;,\r\n    instruction=&amp;quot;&amp;quot;&amp;quot;\r\n    You are an expert researcher. Your goal is to find comprehensive information.\r\n    Use the `google_search` tool to find relevant information.\r\n    Summarize your findings clearly.\r\n    &amp;quot;&amp;quot;&amp;quot;,\r\n    tools=[google_search],\r\n)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f360075e8e0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;See? Simple. It doesn't know about courses or frontends. It just researches.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Step 2: The Judge (Structured Output)&lt;/span&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/building-distributed-ai-agents-judge.max-1000x1000.png"
        
          alt="building-distributed-ai-agents-judge"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We can't have our agents rambling. We need strict pass or fail grades so our code can make decisions. We use &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Pydantic&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to enforce this contract.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# judge/app/agent.py\r\nfrom pydantic import BaseModel, Field\r\nfrom typing import Literal\r\n\r\nclass JudgeFeedback(BaseModel):\r\n    status: Literal[&amp;quot;pass&amp;quot;, &amp;quot;fail&amp;quot;] = Field(\r\n        description=&amp;quot;Whether the research is sufficient (\&amp;#x27;pass\&amp;#x27;) or needs more work (\&amp;#x27;fail\&amp;#x27;).&amp;quot;\r\n    )\r\n    feedback: str = Field(\r\n        description=&amp;quot;Detailed feedback on what is missing.&amp;quot;\r\n    )\r\n\r\njudge = Agent(\r\n    name=&amp;quot;judge&amp;quot;,\r\n    model=&amp;quot;gemini-2.5-flash&amp;quot;,\r\n    description=&amp;quot;Evaluates research findings.&amp;quot;,\r\n    instruction=&amp;quot;&amp;quot;&amp;quot;\r\n    You are a strict editor. Evaluate the findings.\r\n    If they are missing key info, output status=\&amp;#x27;fail\&amp;#x27; and provide feedback.\r\n    &amp;quot;&amp;quot;&amp;quot;,\r\n    output_schema=JudgeFeedback, # Enforce the contract!\r\n)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f360075e1c0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now, when the judge speaks, it speaks JSON. Your application logic can trust it.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Step 3: The Universal Language (A2A Protocol)&lt;/span&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/building-distributed-ai-agents-a2a-protoco.max-1000x1000.png"
        
          alt="building-distributed-ai-agents-a2a-protocol"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Here's the magic. We wrap these agents as web services using the &lt;/span&gt;&lt;a href="https://a2a-protocol.org" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;A2A Protocol&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. Think of it as a universal language for agents. It lets them describe what they do (&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;agent.json&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;) and talk over standard HTTP.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# researcher/app/server.py\r\nfrom fastapi import FastAPI\r\nfrom a2a.server.apps import A2AFastAPIApplication\r\nfrom app.agent import app as adk_app\r\n\r\n# ... setup runner ...\r\n\r\n# Create the A2A App wrapper\r\na2a_app = A2AFastAPIApplication(agent_card=agent_card, http_handler=request_handler)\r\n\r\napp = FastAPI(lifespan=lifespan)\r\n\r\n# Register routes: /.well-known/agent.json and /rpc\r\na2a_app.add_routes_to_app(app)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f360075ef70&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now, your researcher is a microservice running on port 8000. It's ready to be called by anyone—including your orchestrator.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Step 4: The Orchestrator Pattern&lt;/span&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/building-distributed-ai-agents-orchestrato.max-1000x1000.png"
        
          alt="building-distributed-ai-agents-orchestrator"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This is where it all comes together. The &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;orchestrator&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; is the general contractor. It doesn't do the research; it hires the researcher. It doesn't make judgments; it asks the judge.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Crucially, &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;this is the only agent your frontend needs to know about&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# orchestrator/app/agent.py\r\nfrom google.adk.agents import LoopAgent, SequentialAgent\r\nfrom google.adk.agents.remote_a2a_agent import RemoteA2aAgent\r\n\r\n# Connect to the remote Researcher service\r\nresearcher = RemoteA2aAgent(\r\n    name=&amp;quot;researcher&amp;quot;,\r\n    agent_card=&amp;quot;http://researcher-service:8000/.well-known/agent.json&amp;quot;,\r\n    description=&amp;quot;Gathers information on a topic.&amp;quot;\r\n)\r\n\r\n# Connect to the remote Judge service\r\njudge = RemoteA2aAgent(\r\n    name=&amp;quot;judge&amp;quot;,\r\n    agent_card=&amp;quot;http://judge-service:8000/.well-known/agent.json&amp;quot;,\r\n    description=&amp;quot;Evaluates research findings.&amp;quot;\r\n)\r\n\r\n# The Orchestrator manages the loop\r\nresearch_loop = LoopAgent(\r\n    name=&amp;quot;research_loop&amp;quot;,\r\n    sub_agents=[researcher, judge, escalation_checker],\r\n    max_iterations=3,\r\n)\r\n\r\n# The full pipeline\r\nroot_agent = SequentialAgent(\r\n    name=&amp;quot;course_creation_pipeline&amp;quot;,\r\n    sub_agents=[research_loop, content_builder],\r\n)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f360075ea30&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The orchestrator handles the complexity—retries, loops, state management—so your frontend stays clean and simple.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Deployment: The "Grocery Store" Model&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Deploying this system on Cloud Run gives you what I call the "grocery store" model. If the checkout lines (researcher tasks) get long, you don't build a new store. You just open more registers. Cloud Run scales your researcher service independently to handle the load, while your judge service stays lean.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Caveats &amp;amp; Security Considerations&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Of course, with great power comes great responsibility (and security reviews).&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Authentication&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: In this demo, agents talk over open HTTP. In production, you &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;must&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; lock this down. Use mTLS, OIDC, or API keys to ensure that only your orchestrator can talk to your researcher.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Latency&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Every hop adds time. Use this pattern for coarse-grained tasks (like "research this topic") rather than chatty, low-level interactions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Error handling&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Networks fail. Your orchestrator needs to be robust enough to handle timeouts and retries gracefully.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to Build?&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Stop trying to build one giant agent that does it all. By using the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;orchestrator pattern&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; and distributed microservices, you can build AI systems that are scalable, maintainable, and—best of all—play nicely with the apps that you already have.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Want to see the code? Check out the full &lt;/span&gt;&lt;a href="https://github.com/amitkmaraj/course-creation-ai-agent-architecture" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Course Creator Agent Architecture on GitHub&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;And if you're ready to deploy, get started with &lt;/span&gt;&lt;a href="https://cloud.google.com/run"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Run&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;a href="https://github.com/google/adk-python" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;ADK&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, and &lt;/span&gt;&lt;a href="https://a2a-protocol.org" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;A2A&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to bring your agent team to life.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 18 Mar 2026 19:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/building-distributed-ai-agents/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/building-distributed-ai-agents-hero.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Building Distributed AI Agents</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/building-distributed-ai-agents-hero.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/building-distributed-ai-agents/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Amit Maraj</name><title>AI Developer Relations Engineer</title><department></department><company></company></author></item><item><title>Create Expert Content: Building Capabilities for a Multi-Agent System with Google ADK, MCP, and Cloud Run</title><link>https://cloud.google.com/blog/topics/developers-practitioners/build-a-multi-agent-system-for-expert-content-with-google-adk-mcp-and-cloud-run-part-1/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;My team’s mission is to accelerate the developer journey from writing code to running secure AI workloads on Google Cloud. To help developers succeed, we focus on identifying their most pressing questions and building demos that provide straightforward, easy-to-implement solutions.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Recently, I was struck with inspiration when the new &lt;/span&gt;&lt;a href="https://developers.google.com/knowledge/mcp?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Developer Knowledge MCP server&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; was released. It led me to build &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Dev Signal&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;—a multi-agent system designed with &lt;/span&gt;&lt;a href="https://github.com/google/adk-python" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Agent Development Kit (ADK)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;—to identify technical questions from Reddit, research them using official documentation, and draft detailed technical blogs. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Dev Signa&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;l also provides custom visuals using &lt;/span&gt;&lt;a href="https://blog.google/innovation-and-ai/products/nano-banana-pro/?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blo" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Nano Banana Pro.&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; I even integrated a long-term &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/agent-builder/agent-engine/memory-bank/overview?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;memory&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; layer so the agent remembers my specific preferences and blogging style.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By connecting my coding assistant, &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/gemini/docs/codeassist/gemini-cli?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Gemini CLI&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, to the developer knowledge MCP server, I built and deployed this entire system to &lt;/span&gt;&lt;a href="https://cloud.google.com/run/docs?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud Run&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in just two days.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Whether you want to learn how to architect a complex multi-agent system with long term memory, leverage local and remote MCP servers for tool standardization, or write detailed Terraform scripts for secure Cloud Run deployment, I'll show you how!&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If you’d rather dive straight into the code and explore it at your own pace, you can clone the repository &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-video"&gt;



&lt;div class="article-module article-video "&gt;
  &lt;figure&gt;
    &lt;a class="h-c-video h-c-video--marquee"
      href="https://youtube.com/watch?v=abZxJiXGrJs"
      data-glue-modal-trigger="uni-modal-abZxJiXGrJs-"
      data-glue-modal-disabled-on-mobile="true"&gt;

      
        &lt;img src="//img.youtube.com/vi/abZxJiXGrJs/maxresdefault.jpg"
             alt="A YouTube video that walks through a demo to set up the Dev Signal system"/&gt;
      
      &lt;svg role="img" class="h-c-video__play h-c-icon h-c-icon--color-white"&gt;
        &lt;use xlink:href="#mi-youtube-icon"&gt;&lt;/use&gt;
      &lt;/svg&gt;
    &lt;/a&gt;

    
  &lt;/figure&gt;
&lt;/div&gt;

&lt;div class="h-c-modal--video"
     data-glue-modal="uni-modal-abZxJiXGrJs-"
     data-glue-modal-close-label="Close Dialog"&gt;
   &lt;a class="glue-yt-video"
      data-glue-yt-video-autoplay="true"
      data-glue-yt-video-height="99%"
      data-glue-yt-video-vid="abZxJiXGrJs"
      data-glue-yt-video-width="100%"
      href="https://youtube.com/watch?v=abZxJiXGrJs"
      ng-cloak&gt;
   &lt;/a&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h2&gt;What you'll learn&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;span style="vertical-align: baseline;"&gt;In this four-part blog series, I’ll walk you through the step-by-step process of how I brought this project to life. &lt;/span&gt;&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Each blog post captures the journey of building and deploying &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Dev Signal&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Part 1: Tools for building agent capabilities &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;– You’ll begin by setting up your project environment and equipping your agent with tools using the Model Context Protocol (MCP). You’ll learn how to connect to Reddit for trend discovery, Google Cloud docs for technical grounding, and a custom Nano Banana Pro tool for image generation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Part 2: The Multi-Agent Architecture with long term memory &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;– You’ll build the "brain" of the system by implementing a root orchestrator and a team of specialized agents. You’ll also integrate the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/agent-builder/agent-engine/memory-bank/overview?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI memory bank&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, enabling the agent to learn and persist your preferences across sessions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Part 3: Testing the agent Locally&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; – Before moving to the cloud, you’ll synchronize the agent's components and verify its performance on your workstation. You’ll use a dedicated test runner to simulate the full lifecycle of discovery, research, and multimodal creation, &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;with a special focus on validating long-term memory persistence by connecting your local agent directly to the cloud-based Vertex AI memory bank.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Part 4: Deployment to Cloud Run and the Path to Production &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;– Finally, you’ll deploy your service on Google Cloud Run using Terraform for reproducible infrastructure. You’ll also discuss the next steps required for a high quality secure production system.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Getting started with Dev Signal&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Dev Signal&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; is an intelligent monitoring agent designed to filter noise and create value. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Dev Signal&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; operates in the following ways:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Discovery&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Scouts Reddit for high-engagement technical questions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Grounding&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Researches answers using official Google Cloud documentation to ensure accuracy.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Creation&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Drafts professional technical blog posts based on its findings.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Multimodal Generation&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Generates custom infographic headers for those posts.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Long-Term Memory&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Uses &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Vertex AI memory bank&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to remember your feedback across different sessions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Prerequisites&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Before you begin, verify the following is installed: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Python 3.12+&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;uv&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (Python package manager): &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;curl -LsSf https://astral.sh/uv/install.sh | sh&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://cloud.google.com/sdk/docs/install?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud SDK&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;gcloud&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; CLI) installed and authenticated.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://developer.hashicorp.com/terraform/install" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Terraform&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (for infrastructure as code).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.npmjs.com/downloading-and-installing-node-js-and-npm" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Node.js &amp;amp; npm&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (required for the Reddit MCP tool).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You will also need:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;A &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/resource-manager/docs/creating-managing-projects?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud Project&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; with billing enabled.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/endpoints/docs/openapi/enable-api?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;&lt;strong&gt;APIs Enabled&lt;/strong&gt;&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;: Vertex AI, Cloud Run, Secret Manager, Artifact Registry.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Reddit API Credentials&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (Client ID, Secret) - You can get these from the &lt;/span&gt;&lt;a href="https://www.reddit.com/prefs/apps" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Reddit Developer Portal&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Developer Knowledge API Key&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (for Google Cloud docs search) - Instructions on how to get it are &lt;/span&gt;&lt;a href="https://developers.google.com/knowledge/mcp?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Project Setup&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Dev Signal&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; system was built by first running the&lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/agent-starter-pack" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt; Agent Starter Pack,&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; following the automated architect workflow described in the &lt;/span&gt;&lt;a href="https://www.youtube.com/watch?v=XCGbDx7aSks" rel="noopener" target="_blank"&gt;&lt;span style="vertical-align: baseline;"&gt;Agent Factory episode&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; by &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/remigiusz-samborski/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Remigiusz Samborski&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/vkolesnikov/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vlad Kolesnikov&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. This foundation provided the project’s modular directory structure, which is used to separate concerns between Agent Logic, Server Code, Utilities, and Tools.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The starter pack acts as a powerful starting point because it automates the creation of professional infrastructure, CI/CD pipelines, and observability tools in seconds. This allows you to focus entirely on the agent’s unique intelligence while ensuring the underlying platform remains secure and scalable. By building on top of this generated boilerplate with AI assistance from &lt;/span&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemini-cli-open-source-ai-agent/?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Gemini CLI&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://antigravity.google/?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Antigravity&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, the development process is highly accelerated. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The agent starter pack high level architecture:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/agentstarterpack.max-1000x1000.png"
        
          alt="agentstarterpack"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;1. Initialize the Project&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Create a new directory for your project and initialize it. We'll use &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;uv&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, which is an extremely fast Python package manager.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;uv init dev-signal&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35eca546a0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;2. Folder Structure&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our project will follow this structure. We will populate these files step-by-step.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;dev-signal/\r\n├── dev_signal_agent/\r\n│   ├── __init__.py\r\n│   ├── agent.py           # Agent logic &amp;amp; orchestration\r\n│   ├── fast_api_app.py    # Application server &amp;amp; memory connection\r\n│   ├── app_utils/         # Env Config\r\n│   │   └── env.py\r\n│   └── tools/             # External capabilities\r\n│       ├── __init__.py\r\n│       ├── mcp_config.py  # Tool configuration (Reddit, Docs)\r\n│       └── nano_banana_mcp/# Custom local image generation tool\r\n│           ├── __init__.py\r\n│           ├── main.py\r\n│           ├── nano_banana_pro.py\r\n│           ├── media_models.py\r\n│           ├── storage_utils.py\r\n│           └── requirements.txt\r\n├── deployment/\r\n│   └── terraform/         # Infrastructure as Code\r\n├── .env                   # Local secrets (API keys)\r\n├── Makefile               # Shortcuts for building/deploying\r\n├── Dockerfile             # Container definition\r\n└── pyproject.toml         # Dependencies&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35eca543a0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;3. Define Dependencies&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Update your &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;pyproject.toml&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; with the necessary dependencies. We use &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;google-adk&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; for the agent framework and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;google-genai&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; for the model interaction.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;[project]\r\nname = &amp;quot;dev-signal&amp;quot;\r\nversion = &amp;quot;0.1.0&amp;quot;\r\ndescription = &amp;quot;A multi-agent system for monitoring and content creation.&amp;quot;\r\nreadme = &amp;quot;README.md&amp;quot;\r\nrequires-python = &amp;quot;&amp;gt;=3.12, &amp;lt;3.14&amp;quot;\r\ndependencies = [\r\n     &amp;quot;google-adk&amp;gt;=0.1.0&amp;quot;,\r\n    \xa0&amp;quot;google-genai&amp;gt;=1.0.0&amp;quot;,\r\n     &amp;quot;mcp&amp;gt;=1.0.0&amp;quot;,\r\n    \xa0&amp;quot;python-dotenv&amp;gt;=1.0.0&amp;quot;,\r\n     &amp;quot;fastapi&amp;gt;=0.110.0&amp;quot;,\r\n     &amp;quot;uvicorn&amp;gt;=0.29.0&amp;quot;,\r\n     &amp;quot;google-cloud-logging&amp;gt;=3.0.0&amp;quot;,\r\n     &amp;quot;google-cloud-aiplatform&amp;gt;=1.38.0&amp;quot;,\r\n    \xa0&amp;quot;fastmcp&amp;gt;=2.13.0&amp;quot;,\r\n     &amp;quot;google-cloud-storage&amp;gt;=3.6.0&amp;quot;,\r\n     &amp;quot;google-auth&amp;gt;=2.0.0&amp;quot;,\r\n     &amp;quot;google-cloud-secret-manager&amp;gt;=2.26.0&amp;quot;,\r\n]&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35eca54fa0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Run &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;uv sync&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to install everything.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Create a new directory for the agent code.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;mkdir dev_signal_agent\r\ncd dev_signal_agent&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35eca54eb0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Building the agent capabilities: MCP tools &lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our agent needs to interact with the outside world. We use the &lt;/span&gt;&lt;a href="https://modelcontextprotocol.io/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Model Context Protocol (MCP)&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to standardize this. The &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Model Context Protocol (MCP)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; is a universal standard for connecting AI agents to external data and tools. Instead of writing custom API wrappers, we use standard MCP servers. This allows us to connect to APIs (Reddit), Knowledge Bases (Google Cloud Docs), and even local scripts (Image Generation using Nano Banana Pro) using a common interface. Create a new directory for the agent tools.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;mkdir tools\r\ncd tools&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35eca540d0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Tools Configuration&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We'll define our toolsets in &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/mcp_config.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This file defines the connection parameters for our three main tools.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Reddit&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Connected via a local stdio subprocess.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Developer Knowledge&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Connected via a remote HTTP endpoint.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Nano Banana&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Connected via a local stdio subprocess (our custom Python script).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Reddit Search (Discovery Tool)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;a href="https://github.com/Arindam200/reddit-mcp" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Reddit MCP server &lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;acts as a bridge to the Reddit API, allowing your agent to discover trending posts and analyze engagement without you having to write complex API wrappers. To ensure portability, the code uses a "find or fetch" strategy: it first checks for a local installation and, if missing, automatically uses &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;npx&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to download and run the server on demand.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Instead of a network connection, the agent launches the server as a local subprocess and communicates via standard input and output (stdio). Within the Google ADK, the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;McpToolset&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; class acts as a universal wrapper that standardizes these connections, enabling your agent to interact with various tools, from community resources to custom scripts like the Nano Banana image generator, using a common interface. By securely passing API credentials through environment variables, the system ensures these "plug-and-play" modules function as a seamless bridge between the AI and external platforms.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/mcp_config.py:&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;import os\r\nimport shutil\r\nfrom mcp import StdioServerParameters\r\nfrom google.adk.tools import McpToolset\r\nfrom google.adk.tools.mcp_tool import StreamableHTTPConnectionParams, StdioConnectionParams\r\n\r\ndef get_reddit_mcp_toolset(client_id: str = &amp;quot;&amp;quot;, client_secret: str = &amp;quot;&amp;quot;, user_agent: str = &amp;quot;&amp;quot;):\r\n    &amp;quot;&amp;quot;&amp;quot;\r\n    Connects to the Reddit MCP server.\r\n    This server runs as a local subprocess (stdio) and proxies requests to the Reddit API.\r\n    &amp;quot;&amp;quot;&amp;quot;\r\n    # Check if \&amp;#x27;reddit-mcp\&amp;#x27; is installed globally, otherwise use npx to run it\r\n    cmd = &amp;quot;reddit-mcp&amp;quot; if shutil.which(&amp;quot;reddit-mcp&amp;quot;) else &amp;quot;npx&amp;quot;\r\n    args = [] if shutil.which(&amp;quot;reddit-mcp&amp;quot;) else [&amp;quot;-y&amp;quot;, &amp;quot;--quiet&amp;quot;, &amp;quot;reddit-mcp&amp;quot;]\r\n    \r\n    # Inject secrets into the environment of the subprocess only\r\n    env = {\r\n        **os.environ, \r\n        &amp;quot;DOTENV_CONFIG_SILENT&amp;quot;: &amp;quot;true&amp;quot;, \r\n        &amp;quot;LANG&amp;quot;: &amp;quot;en_US.UTF-8&amp;quot;\r\n    }\r\n\r\n    if client_id: env[&amp;quot;REDDIT_CLIENT_ID&amp;quot;] = client_id\r\n    if client_secret: env[&amp;quot;REDDIT_CLIENT_SECRET&amp;quot;] = client_secret\r\n    if user_agent: env[&amp;quot;REDDIT_USER_AGENT&amp;quot;] = user_agent\r\n\r\n    return McpToolset(\r\n        connection_params=StdioConnectionParams(\r\n            server_params=StdioServerParameters(\r\n                command=cmd, \r\n                args=args, \r\n                env=env # Pass injected secrets directly to the subprocess\r\n            ),\r\n            timeout=120.0\r\n        )\r\n    )&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35eca54400&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Google Cloud Docs (Knowledge Tool)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;a href="https://developers.google.com/knowledge/mcp?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Developer Knowledge MCP server provides&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; grounding for your agent by allowing it to search the entire corpus of official Google Cloud documentation. Unlike the local Reddit server, this is a managed service hosted by Google and accessed as a remote endpoint over the internet. It exposes specialized tools like &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;google_developer_documentation_search&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; for semantic queries and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;google_developer_documentation_fetch&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to retrieve full markdown content, ensuring that every technical claim the agent makes is supported by definitive, up-to-date facts. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Note:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; you can also connect your coding assistant tools such as &lt;/span&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemini-cli-open-source-ai-agent/?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Gemini CLI&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or &lt;/span&gt;&lt;a href="https://antigravity.google/?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Antigravity&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to the developer knowledge MCP server to empower them with handy up to date Google Cloud documentation. I used it when writing this blog!&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To connect, the agent uses the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;McpToolset&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; class with &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;StreamableHTTPConnectionParams&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, pointing to a web URL instead of launching a local process. It securely authenticates using a &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;DK_API_KEY&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; (&lt;/span&gt;&lt;a href="https://developers.google.com/knowledge/mcp?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;create your api key&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;) passed in the request headers, allowing the agent to perform a "comprehensive research sweep" across official docs, community sentiment, and broader web context through a single standardized interface. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/mcp_config.py:&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;def get_dk_mcp_toolset(api_key: str = &amp;quot;&amp;quot;):\r\n    &amp;quot;&amp;quot;&amp;quot;\r\n    Connects to Developer Knowledge (Google Cloud Docs).\r\n    This is a remote MCP server accessed via HTTP.\r\n    &amp;quot;&amp;quot;&amp;quot;\r\n    headers = {}\r\n    if api_key:\r\n        headers[&amp;quot;X-Goog-Api-Key&amp;quot;] = api_key\r\n    else:\r\n        # Fallback to os.environ for local testing if not passed via API\r\n        headers[&amp;quot;X-Goog-Api-Key&amp;quot;] = os.getenv(&amp;quot;DK_API_KEY&amp;quot;, &amp;quot;&amp;quot;)\r\n\r\n    return McpToolset(\r\n        connection_params=StreamableHTTPConnectionParams(\r\n            url=&amp;quot;https://developerknowledge.googleapis.com/mcp&amp;quot;,\r\n            headers=headers\r\n        )\r\n    )&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35eca54190&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The Image Generator (Nano Banana MCP)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;While we've used external MCP servers for Reddit and documentation, we can also build our own custom MCP server to wrap specific Python logic. In this case, we are creating an image generation tool powered by Gemini 3 Pro Image (also known as Nano Banana Pro). This demonstrates that any Python function can be standardized into a tool that any agent can understand.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;How the image generation works:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://gofastmcp.com/getting-started/welcome" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;FastMCP&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;: We use the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;fastmcp&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; library to drastically simplify server creation, allowing us to register Python functions as tools with just a few lines of code.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Gemini Integration&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The server uses the Google GenAI SDK to call the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;gemini-3-pro-image-preview&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; model, which converts the agent's descriptive prompts into raw image bytes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;GCS Upload &amp;amp; Hosting:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Because agent interfaces typically require a URL to display images, the server automatically uploads the generated bytes to Google Cloud Storage (GCS) and returns a public link.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To connect this local tool, we use &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;StdioConnectionParams&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; because the server runs as a local subprocess communicating via standard input and output. This transport method directly matches the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;transport="stdio"&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; configuration we will define in our server entrypoint, ensuring a seamless connection for your custom local scripts.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The following code defines the MCP connection in &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/mcp_config.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. We use &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;uv run&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to ensure the server starts in an isolated environment with all its dependencies correctly installed.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/mcp_config.py:&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;def get_nano_banana_mcp_toolset():\r\n    &amp;quot;&amp;quot;&amp;quot;\r\n    Connects to our local \&amp;#x27;Nano Banana\&amp;#x27; image generator.\r\n    This demonstrates how to wrap a local Python script as an MCP tool.\r\n    &amp;quot;&amp;quot;&amp;quot;\r\n    path = os.path.join(&amp;quot;dev_signal_agent&amp;quot;, &amp;quot;tools&amp;quot;, &amp;quot;nano_banana_mcp&amp;quot;, &amp;quot;main.py&amp;quot;)\r\n    bucket = os.getenv(&amp;quot;AI_ASSETS_BUCKET&amp;quot;)     \r\n    return McpToolset(\r\n        connection_params=StdioConnectionParams(\r\n            server_params=StdioServerParameters(\r\n                command=&amp;quot;uv&amp;quot;, \r\n                args=[&amp;quot;run&amp;quot;, path], \r\n                env={**os.environ, &amp;quot;AI_ASSETS_BUCKET&amp;quot;: bucket}\r\n            ),\r\n            timeout=600.0 # Image generation can take time\r\n        )\r\n    )&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f35eca54e20&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Implementing the Nano Banana Pro Server Logic&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now, we will implement the actual logic for this server. This implementation is based on the &lt;/span&gt;&lt;a href="https://www.youtube.com/watch?v=XCGbDx7aSks&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=2" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent Factory&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; demo &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/a9a5f64a3394a4b5ecc64061f397bd5ed82927ee/ai-ml/agent-factory-antigravity-nano-banana-pro/mcp" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;code&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; by Remigiusz Samborski. While Remi's original code provides instructions for deploying the MCP server to Cloud Run, we will run it here as a local subprocess for faster development and testing.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To get started, create the directory for our new server:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;mkdir -p dev_signal_agent/tools/nano_banana_mcp\r\ncd dev_signal_agent/tools/nano_banana_mcp&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026c5a60&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;The Server Entrypoint (&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;main.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; )&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This file acts as the "brain" that initializes and starts the MCP server.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;FastMCP Initialization: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;We use the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;FastMCP&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; library to create a server named "MediaGenerators" and register our &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;generate_image&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; function as a tool&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Safe Logging: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;_initialize_console_logging&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; function is critical. It forces all logs to &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;sys.stderr&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. This is because the MCP "stdio" transport uses &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;sys.stdout&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; for communication between the agent and the tool; standard logs sent to &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;stdout&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; would corrupt that protocol.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Execution&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;mcp.run(transport="stdio")&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; line starts the server as a local subprocess, allowing it to listen for requests from your agent via standard input.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/nano_banana_mcp/main.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;import logging\r\nimport os\r\nimport sys\r\nfrom fastmcp import FastMCP\r\nfrom dotenv import load_dotenv\r\nfrom nano_banana_pro import generate_image\r\n\r\ndef _initialize_console_logging(min_level: int = logging.INFO):\r\n    # Ensure logs go to STDERR so they don\&amp;#x27;t break the MCP stdio protocol\r\n    handler = logging.StreamHandler(sys.stderr)\r\n    logging.basicConfig(level=min_level, handlers=[handler], force=True)\r\n\r\ntools = [generate_image]\r\nmcp = FastMCP(name=&amp;quot;MediaGenerators&amp;quot;, tools=tools)\r\n\r\nif __name__ == &amp;quot;__main__&amp;quot;:\r\n    load_dotenv()\r\n    _initialize_console_logging()\r\n    mcp.run(transport=&amp;quot;stdio&amp;quot;)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026c5580&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;The Generation Logic (&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;nano_banana_pro.py)&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This is where the actual image generation happens using Gemini.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;GenAI Client:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We initialize the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;genai.Client()&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to interact with Google's generative models.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Model Selection:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; It specifically targets the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;gemini-3-pro-image-preview&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; model. We set the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;response_modalities&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to "IMAGE" to tell the model we want pixels, not just text.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Robustness&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The code includes a &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;MAX_RETRIES&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; loop (set to 5) to handle any transient generation errors, ensuring the agent has multiple attempts to get a valid image.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Byte Processing: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Once the model generates the image, it arrives as raw inline data. We extract these bytes and call our helper to move them to the cloud.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;URI Conversion:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Finally, it replaces the internal &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;gs://&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; path with a browser-accessible &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;https://&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; URL so the user can actually see the image.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/nano_banana_mcp/&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;nano_banana_pro&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;import logging\r\nfrom typing import Literal, Optional\r\nfrom google import genai\r\nfrom google.genai import types\r\nfrom media_models import MediaAsset\r\nfrom storage_utils import upload_data_to_gcs\r\n\r\nAUTHORIZED_URI = &amp;quot;https://storage.mtls.cloud.google.com/&amp;quot;\r\nMAX_RETRIES = 5\r\n\r\nasync def generate_image(\r\n    prompt: str,\r\n    aspect_ratio: Literal[&amp;quot;16:9&amp;quot;, &amp;quot;9:16&amp;quot;] = &amp;quot;16:9&amp;quot;,\r\n) -&amp;gt; MediaAsset:\r\n    &amp;quot;&amp;quot;&amp;quot;Generates an image using Gemini 3 Image model.&amp;quot;&amp;quot;&amp;quot;\r\n    genai_client = genai.Client()\r\n    content = types.Content(parts=[types.Part.from_text(text=prompt)], role=&amp;quot;user&amp;quot;)\r\n    \r\n    logging.info(f&amp;quot;Starting image generation for prompt: {prompt[:50]}...&amp;quot;)\r\n    asset = MediaAsset(uri=&amp;quot;&amp;quot;)\r\n    \r\n    for _ in range(MAX_RETRIES):\r\n        response = genai_client.models.generate_content(\r\n            model=&amp;quot;gemini-3-pro-image-preview&amp;quot;,\r\n            contents=[content],\r\n            config=types.GenerateContentConfig(\r\n                response_modalities=[&amp;quot;IMAGE&amp;quot;],\r\n                image_config=types.ImageConfig(aspect_ratio=aspect_ratio)\r\n            )\r\n        )\r\n        if response and response.parts:\r\n            for part in response.parts:\r\n                if part.inline_data and part.inline_data.data:\r\n                    # Upload the raw bytes to GCS\r\n                    gcs_uri = await upload_data_to_gcs(\r\n                        &amp;quot;mcp-tools&amp;quot;,\r\n                        part.inline_data.data,\r\n                        part.inline_data.mime_type\r\n                    )\r\n                    asset = MediaAsset(uri=gcs_uri)\r\n                    break\r\n        if asset.uri: break\r\n\r\n    if not asset.uri:\r\n        asset.error = &amp;quot;No image was generated.&amp;quot;\r\n    else:\r\n        # Convert gs:// URI to an HTTP accessible URL if needed\r\n        asset.uri = asset.uri.replace(\&amp;#x27;gs://\&amp;#x27;, AUTHORIZED_URI)\r\n        logging.info(f&amp;quot;Image URL: {asset.uri}&amp;quot;)\r\n        \r\n    return asset&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026c54c0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;GCS Upload Helper (&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;storage_utils.py)&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Since agents need a web link to display images, this utility handles the hosting on Google Cloud Storage (GCS).&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Dynamic Bucket Selection&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: It looks for a bucket name in your environment variables, falling back from &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;AI_ASSETS_BUCKET&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;LOGS_BUCKET_NAME&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to ensure it always has a place to save data.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Unique Filenames:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; We use an MD5 hash of the raw image data to create a unique filename. This prevents filename collisions and acts as a simple way to avoid duplicate uploads of the same image.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Cloud Upload: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;blob.upload_from_string&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; method pushes the raw image bytes directly to your GCS bucket.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/nano_banana_mcp/&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;storage_utils&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;import hashlib\r\nimport mimetypes\r\nimport os\r\nfrom google.cloud.storage import Client, Blob\r\nfrom dotenv import load_dotenv\r\n\r\nload_dotenv()\r\nstorage_client = Client()\r\nai_bucket_name = os.environ.get(&amp;quot;AI_ASSETS_BUCKET&amp;quot;) or os.environ.get(&amp;quot;LOGS_BUCKET_NAME&amp;quot;)\r\nai_bucket = storage_client.bucket(ai_bucket_name)\r\n\r\nasync def upload_data_to_gcs(agent_id: str, data: bytes, mime_type: str) -&amp;gt; str:\r\n    file_name = hashlib.md5(data).hexdigest()\r\n    ext = mimetypes.guess_extension(mime_type) or &amp;quot;&amp;quot;\r\n    blob_name = f&amp;quot;assets/{agent_id}/{file_name}{ext}&amp;quot;\r\n    blob = Blob(bucket=ai_bucket, name=blob_name)\r\n    blob.upload_from_string(data, content_type=mime_type, client=storage_client)\r\n    return f&amp;quot;gs://{ai_bucket_name}/{blob_name}&amp;quot;&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026c5490&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;Data Model (&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;media_models.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;)&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This file ensures that our data follows a strict structure (Schema).&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Structured Output:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; By using a Pydantic &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;BaseModel&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, we guarantee that the tool always returns a consistent JSON object containing a &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;uri&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; (the link) and an optional &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;error&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; message. This makes it much easier for the AI agent to understand and process the tool's result.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/nano_banana_mcp/&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;media_models&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;.py&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;from typing import Optional\r\nfrom pydantic import BaseModel\r\n\r\nclass MediaAsset(BaseModel):\r\n    uri: str\r\n    error: Optional[str] = None&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026c58b0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;Tool Dependencies (&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;requirements.txt)&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;While we use &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;uv&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to run our code, a &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;requirements.txt&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; file remains essential because it defines the specific dependencies &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;uv&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; needs to install for the Nano Banana server to function. This provides the necessary "ingredients" to set up the isolated environment before the server starts.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This file lists the three core libraries required for this tool:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;google-cloud-storage&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Used for hosting the generated images on the cloud.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;google-genai&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Provides the logic for the Gemini 3 Pro image generation.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;fastmcp&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The framework that turns our Python script into a standardized MCP tool.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Paste this code in &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;dev_signal_agent/tools/nano_banana_mcp/&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;requirements&lt;/code&gt;&lt;code style="vertical-align: baseline;"&gt;.txt&lt;/code&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;google-cloud-storage==3.6.*\r\ngoogle-genai==1.52.*\r\nfastmcp==2.13.*&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026c5c40&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Summary&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In this first part of our series, we focused on establishing the agent's core capabilities by standardizing its external integrations through the Model Context Protocol (MCP). We initialized the project using &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;uv&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; for high-speed dependency management and successfully configured three critical toolsets: Reddit for trend discovery, Google Cloud Docs for technical grounding, and a custom Nano Banana MCP server for multimodal image generation. By utilizing the Google ADK’s &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;McpToolset&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, we’ve abstracted away complex API logic into simple, plug-and-play modules, ensuring that our tools share a common interface that decouples integration from intelligence.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For a deeper look into our technical foundation, you can explore the &lt;/span&gt;&lt;a href="https://developers.google.com/knowledge/mcp?utm_campaign=CDR_0x91b1edb5_default_b485268863&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Developer Knowledge MCP server&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to learn more about knowledge grounding or visit the &lt;/span&gt;&lt;a href="https://github.com/google/adk-python" rel="noopener" target="_blank"&gt;&lt;span style="vertical-align: baseline;"&gt;Google ADK GitHub repository&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to explore the framework's core capabilities&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With our toolset fully configured and ready for action, we can now move to &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/multi-agent-architecture-and-long-term-memory-with-adk-mcp-and-cloud-run"&gt;Part 2&lt;/a&gt;, where we will build the multi-agent architecture and integrate the Vertex AI memory bank to orchestrate these capabilities. You can also jump ahead to &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/create-expert-content-local-testing-of-a-multi-agent-system-with-memory"&gt;Part 3&lt;/a&gt;, where we will show you how to test the agent locally to verify these components on your workstation. If you’d like to dive ahead, you can explore the complete code for the entire series in our &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GitHub repository&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Special thanks to&lt;/span&gt;&lt;a href="https://www.linkedin.com/in/remigiusz-samborski/" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt; Remigiusz Samborski &lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;  for the helpful review and feedback on this article.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For more content like this, follow me on &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/shirmeirlador/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Linkedin&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://x.com/shirmeir86?lang=en" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;X&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Wed, 18 Mar 2026 09:18:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/build-a-multi-agent-system-for-expert-content-with-google-adk-mcp-and-cloud-run-part-1/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/devsignalheroimage.max-600x600.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Create Expert Content: Building Capabilities for a Multi-Agent System with Google ADK, MCP, and Cloud Run</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/devsignalheroimage.max-600x600.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/build-a-multi-agent-system-for-expert-content-with-google-adk-mcp-and-cloud-run-part-1/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Shir Meir Lador</name><title>Head of AI, Product DevRel</title><department></department><company></company></author></item><item><title>Introducing multi-cluster GKE Inference Gateway: Scale AI workloads around the world</title><link>https://cloud.google.com/blog/products/containers-kubernetes/multi-cluster-gke-inference-gateway-helps-scale-ai-workloads/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The world of artificial intelligence is moving fast, and so is the need to serve models reliably and at scale. Today, we're thrilled to announce the preview of &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;multi-cluster GKE Inference Gateway&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to enhance the scalability, resilience, and efficiency of your AI/ML inference workloads across multiple Google Kubernetes Engine (GKE) clusters — even those spanning different Google Cloud regions.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Built as an extension of the&lt;/span&gt; &lt;a href="https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GKE Gateway API&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, the multi-cluster Inference Gateway leverages the power of &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/multi-cluster-gateways"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;multi-cluster Gateways&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to provide intelligent, model-aware load balancing for your most demanding AI applications.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_gRilinA.max-1000x1000.jpg"
        
          alt="1"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Why multi-cluster for AI inference?&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As AI models grow in complexity and users become more global, single-cluster deployments can face limitations:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Availability risks:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Regional outages or cluster maintenance can impact service.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Scalability caps:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Hitting hardware limits (GPUs/TPUs) within a single cluster or region.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Resource silos:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Underutilized accelerator capacity in one cluster can’t be used by another&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Latency:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Users far from your serving cluster may experience higher latency&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The multi-cluster GKE Inference Gateway addresses these challenges head-on, providing a variety of features and benefits:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Enhanced high reliability and fault tolerance:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Intelligently route traffic across multiple GKE clusters, including across different regions. If one cluster or region experiences issues, traffic is automatically re-routed, minimizing downtime.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Improved scalability and optimized resource usage:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Pool and leverage GPU/TPU resources from various clusters. Handle demand spikes by bursting beyond the capacity of a single cluster and efficiently utilize available accelerators across your entire fleet.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Globally optimized, model-aware routing:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The Inference Gateway can make smart routing decisions using advanced signals. With &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;GCPBackendPolicy&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, you can configure load balancing based on real-time custom metrics, such as the model server's KV cache utilization metric, so that requests are sent to the best-equipped backend instance. Other modes like in-flight request limits are also supported.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Simplified operations:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Manage traffic to a globally distributed AI service through a single Inference Gateway configuration in a dedicated GKE "config cluster," while your models run in multiple "target clusters."&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;How it works&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In GKE Inference Gateway there are two foundational resources,&lt;/span&gt; &lt;code style="vertical-align: baseline;"&gt;InferencePool&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;InferenceObjective&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. An&lt;/span&gt; &lt;code style="vertical-align: baseline;"&gt;InferencePool&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; acts as a resource group for pods that share the same compute hardware (like GPUs or TPUs) and model configuration, helping to ensure scalable and high-availability serving. An&lt;/span&gt; &lt;code style="vertical-align: baseline;"&gt;InferenceObjective&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; defines the specific model names and assigns serving priorities, allowing Inference Gateway to intelligently route traffic and multiplex latency-sensitive tasks alongside less urgent workloads.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_ek1kPQE.max-1000x1000.png"
        
          alt="2"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With this release, the system uses Kubernetes Custom Resources to manage your distributed inference service. &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;InferencePool&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; resources in each "target cluster" group model-server backends. These backends are exported and become visible as &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;GCPInferencePoolImport&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; resources in the "config cluster." Standard &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;Gateway&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;HTTPRoute&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; resources in the config cluster define the entry point and routing rules, directing traffic to these imported pools. Fine-grained load-balancing behaviors, such as using &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;CUSTOM_METRICS&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; or &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;IN_FLIGHT&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; requests, are configured using the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;GCPBackendPolicy&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; resource attached to &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;GCPInferencePoolImport&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This architecture enables use cases like global low-latency serving, disaster recovery, capacity bursting, and efficient use of heterogeneous hardware.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For more information about GKE Inference Gateway core concepts check out our &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway#understand_key_concepts"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;guide&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Get started today&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As you scale your AI inference serving workloads to more users in more places, we're excited for you to try multi-cluster GKE Inference Gateway. To learn more and get started, check out the documentation:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/concepts/about-multi-cluster-inference-gateway"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;About multi-cluster GKE Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/how-to/setup-multicluster-inference-gateway"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Set up multi-cluster GKE Inference Gateway&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://cloud.google.com/kubernetes-engine/docs/how-to/customize-backend-multicluster-inference-gateway"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Customize backend configurations with GCPBackendPolicy&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;</description><pubDate>Tue, 17 Mar 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/containers-kubernetes/multi-cluster-gke-inference-gateway-helps-scale-ai-workloads/</guid><category>AI &amp; Machine Learning</category><category>GKE</category><category>Networking</category><category>Developers &amp; Practitioners</category><category>Containers &amp; Kubernetes</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Introducing multi-cluster GKE Inference Gateway: Scale AI workloads around the world</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/containers-kubernetes/multi-cluster-gke-inference-gateway-helps-scale-ai-workloads/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Arman Rye</name><title>Senior Product Manager</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Andres Guedez</name><title>Senior Staff Software Engineer</title><department></department><company></company></author></item><item><title>Build Resilient LLM Applications on Vertex AI and Reduce 429 Errors</title><link>https://cloud.google.com/blog/products/ai-machine-learning/reduce-429-errors-on-vertex-ai/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Building applications powered by Large Language Models (LLMs) on Vertex AI is exciting, but hitting a &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/provisioned-throughput/error-code-429"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;429 error&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; can be a frustrating roadblock. These errors signal that your requests are coming in faster than the service can handle them at that moment.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Last year, we &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/learn-how-to-handle-429-resource-exhaustion-errors-in-your-llms?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;published a guide&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to handling these 429 errors. In this article, we’ll dig deeper into Vertex AI’s consumption models and dives into architectural best practices for managing request flows. This way, you can build smooth, resilient, and truly scalable AI applications.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Choosing the right consumption option &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Vertex AI provides a range of consumption models designed to accommodate various API traffic types and volumes. Your primary strategy for minimizing 429 errors is selecting the consumption model that best aligns with your application’s unique traffic patterns. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/Build_Resilient_LLM_Applications_on_Vertex.max-1000x1000.jpg"
        
          alt="Build Resilient LLM Applications on Vertex AI and Reduce 429 Errors"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Default options: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;The default option with Gemini on Vertex AI is Standard Pay-as-you-go (Paygo). For &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/standard-paygo"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Standard Pay-as-you-go&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (Paygo) traffic, Vertex AI uses a system with Usage Tiers. This dynamic approach allocates resources from a shared pool, where your organization’s historical spend determines your Usage Tier and baseline throughput (TPM). This baseline provides a predictable performance floor for typical workloads, while still allowing your application to burst beyond it on a best-effort basis.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If your application generates critical, user-facing traffic that can be unpredictable and require higher reliability than Standard Paygo, &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/priority-paygo"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Priority Paygo&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is designed for you. By adding the priority header to your requests, you signal that this traffic should be prioritized, reducing the likelihood of being throttled. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For applications with consistently high volumes of real-time traffic, &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/provisioned-throughput/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Provisioned Throughput&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (PT) is the only consumption option that provides isolation from the shared PayGo pool, offering a stable experience even during heavy contention on PayGo. With PT, you reserve and pay for a guaranteed throughput, ensuring your important traffic flows smoothly. &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/provisioned-throughput-on-vertex-ai?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;To learn more about PT on Vertex AI, visit our guide here.&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Cost-effective options: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;For traffic that isn't latency sensitive, Vertex AI offers more cost-effective options. The &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/flex-paygo"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Flex PayGo&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is suited for latency-tolerant traffic, processing requests at a lower price. Large-scale, asynchronous jobs, such as offline analysis or bulk data enrichment, are best handled by &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Batch&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. This service manages the entire workflow, including scaling and retries, over a longer period (around 24 hours), insulating your main application from this heavy load.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Complex applications and hybrid approaches: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Complex applications often leverage a hybrid approach: PT for essential real-time traffic, Priority Paygo for fluctuating traffic, Standard Paygo for general requests, and Batch/Flex for latency-tolerant and offline request flows. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Five ways to reduce 429 errors on Vertex AI&lt;/span&gt;&lt;/h3&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Implement smart retries&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When your application encounters a temporary overload error like a 429 (Resource Exhausted) or 503 (Service Unavailable), an immediate retry is not recommended. The best practice is to implement a retry strategy called Exponential Backoff with Jitter. Exponential backoff means that the delay between retry attempts increases exponentially usually up to a predefined maximum delay. This gives the service time to recover from the overload condition. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;SDK &amp;amp; libraries:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; The &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/retry-strategy#configuring-retries"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Gen AI SDK&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;  includes native retry behavior that can be configured via HttpRetryOptions in client parameters. However, you can also leverage specialized libraries like &lt;/span&gt;&lt;a href="https://github.com/jd/tenacity" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Tenacity&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (for Python) or build a custom solution. For a deeper dive, refer to this &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/learn-how-to-handle-429-resource-exhaustion-errors-in-your-llms"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;blog post&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Agentic workflows:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; For developing agents, the &lt;/span&gt;&lt;a href="https://google.github.io/adk-docs/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent Development Kit (ADK)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;offers a&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;a href="https://google.github.io/adk-docs/integrations/reflect-and-retry/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Reflect and Retry plugin&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; that builds resilience into AI workflows by automatically intercepting 429 errors. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Infrastructure &amp;amp; Gateway:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Another robust option for building resilience is &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/apigee-samples/tree/main/llm-circuit-breaking" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;circuit breaking with Apigee&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, which enables you to manage traffic distribution and implement graceful failure handling. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Leverage global model routing&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Vertex AI's infrastructure is distributed across multiple regions. By default, if you target a specific regional endpoint, your request is served from that region. This means your application's availability is tied to the capacity of that single region. This is where the global endpoint becomes an effective tool for enhancing availability and resilience. Instead of being locked into one region, the global endpoint routes your traffic across a fleet of regions where there may be more availability, reducing the potential error rate.&lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;3. Reduce payload via context caching&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;An effective way to reduce the load on Vertex AI is to avoid making calls for repetitive queries. Many production applications, especially chatbots and support systems, see similar questions asked frequently. Instead of re-processing these, you can implement &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/vertex-ai-context-caching?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;context caching&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. With Context Caching, Gemini reuses precomputed cached tokens, allowing you to reduce your API traffic and throughput. This not only saves costs but also reduces latency for repeated content within your prompts. &lt;/span&gt;&lt;/p&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;4. Optimize prompts&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Reducing the token count in each request directly lowers your TPM consumption and costs.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Summarization with Flash-Lite: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Before sending a long conversation history to a model like Gemini Pro, use a lightweight model like &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-lite"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Gemini 2.5 Flash-Lite&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to summarize the context.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Agent memory optimization: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;For&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Agentic workloads you can leverage Vertex AI &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/agent-builder/agent-engine/memory-bank/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent Engine Memory Bank&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. Features like Memory Extraction and Consolidation allow you to distill meaningful facts from a conversation, ensuring your agent remains context-aware without raw chat history.&lt;/span&gt;&lt;/li&gt;
&lt;li role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Prompt hygiene:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Review your prompts and reduce overly verbose JSON schema descriptions (if the model is already familiar) and stripping excessive whitespace or redundant formatting.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;5. Shape traffic&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Sudden bursts of requests are a primary cause of 429 errors. Even if your average traffic rate is low, sharp spikes can strain resources. The goal is to smoothen traffic, spreading requests out over time. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Get started &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to put these patterns into practice? Explore the&lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt; Vertex AI samples on GitHub&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, or jumpstart your next project with the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/learn/overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Google Cloud Beginner’s Guide&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/start"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI quickstart&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or start building your next AI agent with the  &lt;/span&gt;&lt;a href="https://google.github.io/adk-docs/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent Development Kit (ADK)&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Thu, 12 Mar 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/ai-machine-learning/reduce-429-errors-on-vertex-ai/</guid><category>Developers &amp; Practitioners</category><category>AI &amp; Machine Learning</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Build Resilient LLM Applications on Vertex AI and Reduce 429 Errors</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/products/ai-machine-learning/reduce-429-errors-on-vertex-ai/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Richard Liu</name><title>Senior Product Manager, Google Cloud</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Pedro Melendez</name><title>Cloud AI Technical Evangelist</title><department></department><company></company></author></item><item><title>Calling all devs: Build the future of Multimodal AI in the Gemini Live Agent Challenge</title><link>https://cloud.google.com/blog/topics/training-certifications/join-the-gemini-live-agent-challenge/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Hey builders! Stop typing, and start interacting! We are moving beyond the text box. The future of AI is all about immersive, real-time experiences. To celebrate multimodal AI, we’re challenging you to build the next generation of agents that can help you see, hear, speak, and create in the &lt;/span&gt;&lt;a href="https://geminiliveagentchallenge.devpost.com" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Gemini Live Agent Challenge&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-video"&gt;



&lt;div class="article-module article-video "&gt;
  &lt;figure&gt;
    &lt;a class="h-c-video h-c-video--marquee"
      href="https://youtube.com/watch?v=-AAwoj4qN8M"
      data-glue-modal-trigger="uni-modal--AAwoj4qN8M-"
      data-glue-modal-disabled-on-mobile="true"&gt;

      
        

        &lt;div class="article-video__aspect-image"
          style="background-image: url(https://storage.googleapis.com/gweb-cloudblog-publish/images/maxresdefault_n8MQKZ2.max-1000x1000.jpg);"&gt;
          &lt;span class="h-u-visually-hidden"&gt;Build multimodal AI agents in the Gemini Live Agent Challenge&lt;/span&gt;
        &lt;/div&gt;
      
      &lt;svg role="img" class="h-c-video__play h-c-icon h-c-icon--color-white"&gt;
        &lt;use xlink:href="#mi-youtube-icon"&gt;&lt;/use&gt;
      &lt;/svg&gt;
    &lt;/a&gt;

    
  &lt;/figure&gt;
&lt;/div&gt;

&lt;div class="h-c-modal--video"
     data-glue-modal="uni-modal--AAwoj4qN8M-"
     data-glue-modal-close-label="Close Dialog"&gt;
   &lt;a class="glue-yt-video"
      data-glue-yt-video-autoplay="true"
      data-glue-yt-video-height="99%"
      data-glue-yt-video-vid="-AAwoj4qN8M"
      data-glue-yt-video-width="100%"
      href="https://youtube.com/watch?v=-AAwoj4qN8M"
      ng-cloak&gt;
   &lt;/a&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Build Multimodal AI agents in the Gemini Live Agent Challenge&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Why join?&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Hands-on learning with Gemini Live API:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; This is your shot to build the future of immersive AI agents on Google Cloud. We have everything you need to get started: Quickstarts, tutorials, access to the Agent Development Kit (ADK), and webinars hosted by our experts.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Showcase your skills:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; You’ll have the opportunity to break out of the traditional "text box" paradigm. Choose from three exciting categories—&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;The Live Agent&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;The Creative Storyteller&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, or &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;The UI Navigator&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;—to demonstrate the power of your solution .&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Think you have what it takes to win?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Build a solution to showcase your multimodal agent and you could potentially win a share of &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;$80,000 in prizes&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Overall grand prize: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;A trip to Google Cloud Next ’26 in Las Vegas (includes tickets, a travel stipend, and a chance to present on stage), $25,000 in USD, $3,000 in Google Cloud Credits for use with a Cloud Billing Account, virtual coffee with a Google Cloud team member, and the potential opportunity to be featured on our social channels.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Category winners:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; A trip to Google Cloud Next ’26 in Las Vegas (includes tickets), $10,000 in USD, $1,000 in Google Cloud Credits for use with a Cloud Billing Account, virtual coffee with a Google Cloud team member, and the potential opportunity to be featured on our social channels.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Subcategory winners: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;$5,000 in USD and $500 in Google Cloud Credits for use with a Cloud Billing Account&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Honorable mentions:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; $2,000 in USD and $500 in Google Cloud Credits for use with a Cloud Billing Account&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Dig into Multimodal AI&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Your mission is to build and deploy an AI agent on Google Cloud that utilizes multimodal inputs and outputs. We want you to go beyond the traditional text-in/text-out approach.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Whether you are building a real-time translator or a visual web navigator, your agent should interpret the world around it. Here is some inspiration:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The live agent:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Build an agent we can talk to naturally that handles interruptions gracefully. Think real-time translators, vision-enabled tutors, and more.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The creative storyteller:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Blend text, images, audio, and video into one seamless experience. Imagine building an interactive storybook or a full marketing asset generator in a single workflow.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The UI navigator:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Create a helping hand that interprets visual screens. Maybe you want to create a universal web navigator or a visual QA tester that performs actions based on user intent.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Crucial note:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Your project &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;must&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; use a Gemini model (like Gemini 3 or Nano Banana) and the Gen AI SDK or Agent Development Kit (ADK). Lastly, you must use at least one Google Cloud service, such as Firestore, CloudSQL, Cloud Run, or Vertex AI.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to start building?&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Head over to our hackathon website to register, watch the kickoff &lt;/span&gt;&lt;a href="https://www.youtube.com/watch?v=-AAwoj4qN8M" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;video&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, and review the official rules. Submissions are open &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;until March 16, 2026&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://geminiliveagentchallenge.devpost.com" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Register for the Gemini Live Agent Challenge&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Fri, 06 Mar 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/training-certifications/join-the-gemini-live-agent-challenge/</guid><category>Developers &amp; Practitioners</category><category>Training and Certifications</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/Landscape_16x9_6kmmGy3.max-600x600.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Calling all devs: Build the future of Multimodal AI in the Gemini Live Agent Challenge</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/Landscape_16x9_6kmmGy3.max-600x600.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/training-certifications/join-the-gemini-live-agent-challenge/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Dilasha Panigrahi</name><title>Product Marketing Manager</title><department></department><company></company></author></item><item><title>Cost-Effective AI with Ollama, GKE GPU Sharing, and vCluster</title><link>https://cloud.google.com/blog/topics/developers-practitioners/cost-effective-ai-with-ollama-gke-gpu-sharing-and-vcluster/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;As organizations scale their AI workloads, two major challenges often emerge: the high cost of underutilized GPUs and the operational complexity of managing isolated environments for multiple teams. Traditionally, assigning a whole GPU to a single pod is inefficient, but managing separate clusters for every team is operationally heavy.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In this post, we'll demonstrate how to solve both problems by combining Google Kubernetes Engine (GKE) &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/timesharing-gpus#gpu_time-sharing_or_nvidia_mps"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GPU time-sharing&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; with &lt;/span&gt;&lt;a href="https://www.vcluster.com/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;vCluster&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for multi-tenancy. We'll deploy &lt;/span&gt;&lt;a href="https://ollama.com/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Ollama&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to serve open models (like Mistral) in isolated virtual environments that share the same physical GPU infrastructure.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;The Architecture: Virtual Clusters on Shared Hardware&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The architecture leverages GKE Autopilot to abstract away the physical infrastructure. Instead of managing nodes, you simply deploy workloads, and Autopilot provisions the necessary hardware on demand, including GPUs, drivers, etc.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This setup lets teams have their own isolated environments, APIs, and Ollama instances, and potentially different models, while running on the same cost-effective, shared GPU nodes. For example, Team A (e.g., Legal Research) and Team B (e.g., Customer Support) can work in separate environments while they share GPU resources.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/cost-effective-ai-ollama-gke-vcluster-shar.max-1000x1000.png"
        
          alt="cost-effective-ai-ollama-gke-vcluster-shared-nodes"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;vCluster lets you create virtual Kubernetes clusters on top of an existing Kubernetes cluster. It supports various tenancy modes, including the shared nodes model that's shown in the diagram, where each virtual cluster gets its own isolated control plane while sharing the underlying worker nodes. Each virtual cluster can be accessed independently by teams who get full admin access to their cluster without interfering with others. This model also lets you leverage host cluster features when needed, and you have the ability to deploy your own controllers and CRDs inside each virtual cluster.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When you use vCluster, you can use any of these tenancy modes:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Shared nodes&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The shared nodes mode allows multiple virtual clusters to run workloads on the same physical Kubernetes nodes. This configuration is ideal for scenarios where maximizing resource utilization is a top priority—especially for internal developer environments, CI/CD pipelines, and cost-sensitive use cases.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Private nodes&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Using private nodes is a mode for vCluster where, instead of sharing the host cluster's worker nodes, individual worker nodes are joined to a vCluster. These private nodes act as the vCluster's worker nodes and they aren't shared with other vClusters on the same host cluster.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Auto nodes&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: You can configure vCluster to automatically provision and join worker nodes based on the node and resource requirements. To use auto nodes, you need vCluster Platform installed and vCluster needs to be connected to it.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Standalone&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: vCluster Standalone is a different architecture mode for vCluster for the control plane and node. The standalone mode doesn't require a host cluster. vCluster is deployed directly onto nodes like other Kubernetes distributions. vCluster Standalone can run on any type of node, whether it's a bare-metal node or a VM. It provides the strictest isolation for workloads because there's no shared host cluster for the control plane or worker nodes.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Deployment&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To follow along on the deployment steps, make sure that you have the following installed:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/sdk/docs/install-sdk"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;gcloud CLI&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="http://vcluster.com/install" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;vcluster CLI&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://kubernetes.io/docs/reference/kubectl/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;kubectl&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://kubectx.org/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;kubectx&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Step 1: Set up and Create the GKE Autopilot Cluster&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Unlike GKE Standard, we don't need to calculate node counts or configure node pools manually. Instead, we'll automatically create the cluster and then get credentials.&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Set environment variables and create a GKE Autopilot cluster:&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-shell"&gt;export PROJECT_ID=YOUR_PROJECT_ID
export REGION=YOUR_REGION_ID
# Create GKE Autopilot cluster
gcloud container clusters create-auto vcluster-gpu-sharing \
  --region=$REGION --project $PROJECT_ID&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Replace &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;YOUR_PROJECT_ID&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;YOUR_REGION_ID&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; with the Google Cloud project and region that you want to use.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Get the credentials to configure your local kubectl:&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-shell"&gt;gcloud container clusters get-credentials vcluster-gpu-sharing \
  --region $REGION --project $PROJECT_ID&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Step 2: Create Virtual Clusters (vClusters)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With the Autopilot cluster running, we can now create isolated environments for our tenants. We'll create two vClusters, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;demo1&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;demo2&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. You'll need a &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;vcluster.yaml&lt;/code&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;manifest file for configuration.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When you use GKE Autopilot, it might take a few minutes to create the first vCluster. This is because vCluster waits for its own control plane pods to be up and running. Because Autopilot provisions the underlying nodes dynamically in response to this new workload, there's a brief delay while the infrastructure is initialized.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# Create the vcluster configuration file\r\ncat &amp;lt;&amp;lt;EOF &amp;gt; vcluster.yaml\r\n# Place your vCluster configuration here. \r\n# For GPU workloads on GKE Autopilot, this typically involves \r\n# enabling node synchronization so the vCluster can see the \r\n# underlying GPU nodes provided by Autopilot.\r\nsync:\r\n fromHost:\r\n   ingressClasses:\r\n     enabled: true\r\n   nodes:\r\n     enabled: true\r\n toHost:\r\n   ingresses:\r\n     enabled: true\r\nEOF\r\n\r\n# Create the first virtual cluster\r\nvcluster create demo1 -n demo1 -f vcluster.yaml\r\n\r\n# Create the second virtual cluster\r\nvcluster create demo2 -n demo2 -f vcluster.yaml&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601e8a370&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="font-style: italic; vertical-align: baseline;"&gt;Note&lt;/strong&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;: If you receive an error warning that you're trying to create a vCluster inside another, select &lt;code&gt;no&lt;/code&gt; and then switch back to the correct host context.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Step 3: Deploy Ollama to the Virtual Cluster&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We start by creating the deployment manifest for Ollama. This manifest deploys Ollama and uses a Kubernetes Service to expose it on port 11434.&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Create the deployment manifest for Ollama. This manifest deploys Ollama and it uses a Kubernetes Service to expose it on port 11434. Nodes are selected that use &lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/timesharing-gpus"&gt;GPU time-sharing&lt;/a&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-shell"&gt;# Create Ollama deployment manifest
cat &amp;lt;&amp;lt;EOF &amp;gt; ollama.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
 name: ollama
 namespace: default
spec:
 replicas: 1
 selector:
   matchLabels:
     app: ollama
 template:
   metadata:
     labels:
       app: ollama
   spec:
     nodeSelector:
    # Selects nodes that use GPU time-sharing.
    # Selects nodes that allow a specific number of containers
    #  to share the underlying GPU.
    # Select nodes with Nvidia L4 GPUs
       cloud.google.com/gke-gpu-sharing-strategy: "time-sharing"
       cloud.google.com/gke-max-shared-clients-per-gpu: "5"
       cloud.google.com/gke-accelerator: nvidia-l4
     containers:
     - name: ollama
       image: ollama/ollama:latest
       ports:
       - containerPort: 11434
       resources:
         limits:
           nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
 name: ollama
 namespace: default
spec:
 selector:
   app: ollama
 ports:
 - port: 11434
   targetPort: 11434
 type: ClusterIP
EOF&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;When the vCluster is active, switch contexts to work inside demo1:&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class="“language-“shell"&gt;# Connect to the virtual cluster demo1
vcluster connect demo1 -n demo1&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Deploy Ollama in the virtual environment:&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-shell"&gt;# Apply your deployment manifest
kubectl apply -f ollama.yaml&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Even though we're in a virtual cluster, when we create pods that request GPUs, the request is synced to the host. GKE Autopilot detects this requirement and automatically attaches the necessary GPU hardware to the nodes that are running your workloads.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Step 4: Pulling and Testing the Model&lt;/span&gt;&lt;/h3&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;With the server running, perform the model pull and test entirely within the virtual cluster context:&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-shell"&gt;# Execute the pull command inside the pod
kubectl exec -it &amp;lt;pod-name&amp;gt; -- ollama pull mistral&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Verify the API:&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-shell"&gt;# Port forward the Ollama service
kubectl port-forward svc/ollama 8080:11434
# Send a chat request in a new window
curl -s http://localhost:8080/api/chat \
 -H "Content-Type: application/json" \
 -d '{ "model": "mistral", "stream": false, "messages": [ {"role": "user", "content": "Explain GKE Autopilot"} ] }' | jq -r '.message.content'&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Step 5: Deploy Ollama to vCluster demo2&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Repeat the steps to deploy Ollama and pull the model to the second virtual cluster:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# Connect to the virtual cluster\r\nvcluster connect demo2 -n demo2\r\n\r\n# Apply your deployment manifest\r\nkubectl apply -f ollama.yaml\r\n\r\n# Execute the pull command inside the pod\r\nkubectl exec -it &amp;lt;pod-name&amp;gt; -- ollama pull mistral\r\n\r\n# Port forward the Ollama service\r\nkubectl port-forward svc/ollama 8080:11434\r\n\r\n# Send a chat request in a new window\r\ncurl -s http://localhost:8080/api/chat \\\r\n -H &amp;quot;Content-Type: application/json&amp;quot; \\\r\n -d \&amp;#x27;{ &amp;quot;model&amp;quot;: &amp;quot;mistral&amp;quot;, &amp;quot;stream&amp;quot;: false, &amp;quot;messages&amp;quot;: [ {&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Explain GKE Autopilot&amp;quot;} ] }\&amp;#x27; | jq -r \&amp;#x27;.message.content\&amp;#x27;&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601e8a250&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Verify the Underlying Infrastructure&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now let's switch back to the host cluster context and see what's going on.&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Check how many nodes have been provisioned and where are the Ollama pods running:&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-shell"&gt;# List the available contexts
kubectx
# Switch to the host cluster context
kubectx gke_$PROJECT_ID_$REGION_vcluster-gpu-sharing
# List nodes
Kubectl nodes&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You should see two nodes. One is running the vCluster components. The other runs the Ollama instances with L4 GPUs. Your output should look like this (node names will be different):&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Output of kubectl get nodes
$ kubectl get nodes
NAME                                                  STATUS   ROLES    AGE    VERSION
gk3-vcluster-gpu-sharing-nap-1w88cyly-895203e4-xbqk   Ready    &amp;lt;none&amp;gt;   7h8m   v1.33.5-gke.2072000
gk3-vcluster-gpu-sharing-pool-2-0a984fed-7mff         Ready    &amp;lt;none&amp;gt;   4d     v1.33.5-gke.2072000&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;C&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;heck where the Ollama pods are running:&lt;/span&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-shell"&gt;# Check the Nodes running the Ollama pods
kubectl get pods -n demo1 -o wide
kubectl get pods -n demo2 -o wide&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Notice that both Ollama pods are running on the same node. This node has been provisioned by GKE Autopilot with L4 GPUs and GPU Sharing configured.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Conclusion&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By using GKE Autopilot, we've removed the need to manually configure GPU node pools or time-sharing strategies. Autopilot provides resources dynamically, while vCluster ensures that Team A's Legal Research data and Team B's Customer Support bots remain completely isolated. This implementation provides a robust, low-maintenance platform for scaling AI workloads.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Fri, 06 Mar 2026 16:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/cost-effective-ai-with-ollama-gke-gpu-sharing-and-vcluster/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/cost-effective-ai-ollama-gke-vcluster-hero.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Cost-Effective AI with Ollama, GKE GPU Sharing, and vCluster</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/original_images/cost-effective-ai-ollama-gke-vcluster-hero.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/cost-effective-ai-with-ollama-gke-gpu-sharing-and-vcluster/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Abdel Sghiouar</name><title>Senior Cloud Developer Advocate</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Saiyam Pathak</name><title>DevRel</title><department></department><company>vCluster</company></author></item><item><title>Data Strategy = AI Strategy Series: Transforming Developers into AI Architects with Google Cloud</title><link>https://cloud.google.com/blog/topics/developers-practitioners/data-strategy-ai-strategy-series-transforming-developers-into-ai-architects-with-google-cloud/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Your agent is only as good as your data grounding.&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; If your data is messy, your agent will be highly confident but possibly still hallucinating. In 2026, your Data Strategy and your AI Strategy are now the same thing. You cannot have one without the other.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This series, "Data Strategy = AI Strategy," focuses on the various aspects of this strategy and it can help you architect workflows that are more deterministic while still building autonomous agents. This post is just the first episode in which we focus on data and AI architecture convergence.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The industry is reaching a critical inflection point. Although 2024 and 2025 were defined by the "API era"—where developers learned to integrate LLMs through apps &amp;amp; endpoints—2026 demands a shift towards &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;enterprise architecture&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. To build production-ready applications, developers must transition from writing prompts to designing &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;intelligent end-to-end stacks&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The challenge isn't just about the AI model that you use; it's about the infrastructure around it. For an application to be enterprise-grade, it must meet the requirements of three critical pillars: &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;speed&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;scale&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, and &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;security&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;We focus on these pillars in this article: moving away from building agents that are focused on AI adoption to architecting agents that are grounded in well-strategized context. More specifically, this blog and the linked codelabs provide&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; a hands-on learning path that shows you how to build an architecture by using Google's data cloud. This approach uses relational databases that are easily 100% PostgreSQL compatible.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;The Strategic Pivot: The Database as the Context Engine&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In a modern AI stack, the database is no longer just a storage layer; it has become the &lt;em&gt;context engine&lt;/em&gt;. Our strategy centers on using fully PostgreSQL-compatible services like AlloyDB for PostgreSQL and &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud SQL &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;to eliminate the primary bottlenecks of AI in production: latency, AI capabilities, and retrieval accuracy.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To enable this transition from developer to architect, the learning path focuses on eliminating infrastructure friction and prioritizing high-level architectural design.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;strong style="vertical-align: baseline;"&gt;1. Eliminating the Infrastructure Tax&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Historically, the transition from local prototyping to cloud-scale deployment was hindered by the &lt;em&gt;infrastructure tax&lt;/em&gt;—the hours spent on configuring clusters, instances, and VPC network peering. By introducing automated setup utilities, we let developers bypass these configuration hurdles.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The result is a shift in focus: instead of managing infrastructure, developers can spend their time on designing secure data flows and high-throughput vector pipelines. In our recent instructor-led Code Vipassana sessions, each participating developer saved over an hour of time in each lab because of this shift. This approach effectively accelerates the path to production.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;strong style="vertical-align: baseline;"&gt;2. Building for Scale: One Million Vectors, Zero Loops&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To build enterprise architecture, you need to move beyond small-scale demos. We focus on &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;batch processing for embeddings&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; to expedite vector search processes. AlloyDB can generate embeddings at scale directly within the database layer. By using this capability, we can eliminate the latency of traditional loops, which allows us to do real-time analytics on massive datasets.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;strong style="vertical-align: baseline;"&gt;3. Sovereign Intelligence and Row-Level Security (RLS)&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Security in AI is more than only a firewall; it's about data governance. We emphasize the use of &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;row-level security (RLS) &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;to help ensure that AI agents can access only the specific data they're authorized to see. This &lt;em&gt;private vault&lt;/em&gt; architecture is essential for regulated industries where data isolation is a non-negotiable requirement. Imagine your user talking to your agent and learning about another user or a benchmark. Baking the data level security into the database is not an option anymore. We cannot rely on agents to make the call on who should be informed of what.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;The Architectural Learning Path&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We have curated a series of hands-on technical labs that form a complete narrative of enterprise AI development. Each lab represents a specific layer of the intelligent stack&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Most of the labs use AlloyDB. However, the momentum of this architectural strategy has also extended to the Cloud SQL ecosystem. Our learning path includes a couple of alternative labs for Cloud SQL for PostgreSQL users.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our recommended learning path includes the following core architectural labs:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://codelabs.developers.google.com/quick-alloydb-setup" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;AlloyDB Quick Setup Lab&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This lab serves as the entry point for architects, by demonstrating how to provision a high-performance AlloyDB cluster with the required VPC and network settings in minutes. It focuses on the &lt;em&gt;day 0&lt;/em&gt; operations that help to ensure a secure and scalable foundation for all subsequent AI logic.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://codelabs.developers.google.com/connect-to-alloydb-on-cloudrun" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Connect your app to AlloyDB data and deploy on Cloud Run&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (or &lt;/span&gt;&lt;a href="https://codelabs.developers.google.com/connect-to-cloudsql-on-cloudrun" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Connect to Cloud SQL and deploy on Cloud Run&lt;/span&gt;&lt;/a&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Moving into deployment, this lab explores the architecture of serverless applications. Developers learn how to connect Cloud Run services to AlloyDB (or Cloud SQL), with a focus on using managed identities and connection strings to improve security. This approach helps ensure that the application layer is as robust as the data layer. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://codelabs.developers.google.com/gemini-3-flash-on-alloydb-sustainability-app" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Build a Real-Time Surplus Engine: Gemini 3 Flash &amp;amp; AlloyDB&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (or &lt;/span&gt;&lt;a href="https://codelabs.developers.google.com/gemini-3-on-cloudsql-sustainability-app" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Gemini 3 Flash &amp;amp; Cloud SQL&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This lab addresses the pillar for &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;speed&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; by building an end-to-end, data-driven AI app. It demonstrates how to use the high-efficiency Gemini 3 Flash model to process streaming data and generate real-time insights. This approach creates a responsive feedback loop between the database and the end user.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://codelabs.developers.google.com/embeddings-at-scale-with-alloydb" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;One Million Vectors, Zero Loops: Scale with AlloyDB&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Focused on &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;scale&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, this lab dives deeply into the vector search process. Architects learn how to implement batch processing for embeddings directly within the database. This approach bypasses application-layer bottlenecks, which enables the ingestion and search of millions of vectors with enterprise-grade performance.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://codelabs.developers.google.com/zero-trust-agents-with-alloydb" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;The Private Vault: Zero Trust Intelligence with RLS&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The final piece of the architectural puzzle focuses on improved &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;security&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. This lab guides developers through building &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;zero trust&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; agents. By implementing RLS in PostgreSQL, developers can help ensure that their AI agents respect user-specific data boundaries. This approach provides a blueprint for compliant AI systems with enhanced security.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Designing the Future&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By removing the friction from infrastructure and focusing on the core principles of speed, scale, and security, we can empower a new generation of AI architects. This strategic shift can help ensure that the applications built today are ready for the production demands of tomorrow.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To join our upcoming instructor-led, hands-on sessions and begin your transformation from developer to architect, &lt;/span&gt;&lt;a href="https://codevipassana.dev" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;sign up for Code Vipassana&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Tue, 03 Mar 2026 19:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/data-strategy-ai-strategy-series-transforming-developers-into-ai-architects-with-google-cloud/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/ai-strategy-transform-devs-ai-architects-her.max-600x600.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Data Strategy = AI Strategy Series: Transforming Developers into AI Architects with Google Cloud</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/ai-strategy-transform-devs-ai-architects-her.max-600x600.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/data-strategy-ai-strategy-series-transforming-developers-into-ai-architects-with-google-cloud/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Abirami Sukumaran</name><title>Staff Developer Advocate, Google</title><department></department><company></company></author></item><item><title>Announcing the MCP Toolbox Java SDK</title><link>https://cloud.google.com/blog/topics/developers-practitioners/announcing-the-mcp-toolbox-java-sdk/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Engineering teams are moving beyond simple chatbots to build agentic systems that interact directly with mission critical databases. However, building these enterprise agents often means hitting an integration wall of custom glue code, brittle APIs, and complex database logic.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To replace these hardcoded bottlenecks with a secure, unified control plane, we are thrilled to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;announce the&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;a href="https://github.com/googleapis/mcp-toolbox-sdk-java" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Java SDK&lt;/strong&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt; for the Model Context Protocol (MCP) Toolbox for Databases&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. This release brings first-class, typesafe agent orchestration to the world’s most widely adopted enterprise ecosystem. Java's mature architecture is purpose-built for these rigorous demands, providing the high concurrency, strict transactional integrity, and robust state management required to safely scale mission-critical AI agents in production.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;MCP: The USB Type-C for AI Agents&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Think of the Model Context Protocol (MCP) as a universal translator for AI.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Created to standardize how AI models connect to external tools and datasets, MCP replaces custom, fragmented integration scripts with a secure, universal protocol. Whether your agent needs to execute a transactional SQL query, search through thousands of policy documents, or trigger a REST API, MCP provides a single, unified interface. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With &lt;/span&gt;&lt;a href="https://googleapis.github.io/genai-toolbox/getting-started/introduction" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;MCP Toolbox&lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt; for &lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;databases&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, we’ve made implementing this protocol effortless.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;MCP Toolbox for Databases&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://googleapis.github.io/genai-toolbox/getting-started/introduction" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;MCP Toolbox for Databases&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; is an open source MCP server for databases. It natively supports 42 different data sources spanning AlloyDB, Cloud SQL, Cloud Spanner, and many more including third party data sources as well. Crucially, it gives you the ability to define &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;custom tools&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; that safely map an AI agent's natural language intents directly to specific database operations. It enables you to develop tools easier, faster, and more securely by handling the complexities such as connection pooling, authentication, and more.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We already provide robust, production-ready SDKs for &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Python, JavaScript, TypeScript, and Go&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. But when it comes to "Day 2" production workloads—where high concurrency, transactional integrity, and conversational state management are non-negotiable—&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Java and Spring Boot&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; remain the undisputed heavyweights.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With the new Java SDK, you can natively build stateful, highly concurrent multi-agent systems without ever leaving your preferred tech stack. This SDK brings first-class, type-safe orchestration to Java, which is potentially a major priority for enterprise architects.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Get Started with the Java SDK&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We’ve designed the MCP Toolbox Java SDK to be frictionless for enterprise teams. You can start building your agents today.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Add the Dependency&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To bring the MCP Toolbox into your Java or Spring Boot project, simply add the following dependency to your &lt;/span&gt;&lt;code&gt;&lt;span style="vertical-align: baseline;"&gt;pom.xml&lt;/span&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;&amp;lt;dependency&amp;gt;\r\n   &amp;lt;groupId&amp;gt;com.google.cloud.mcp&amp;lt;/groupId&amp;gt;\r\n   &amp;lt;artifactId&amp;gt;mcp-toolbox-sdk-java&amp;lt;/artifactId&amp;gt;\r\n   &amp;lt;version&amp;gt;0.2.0&amp;lt;/version&amp;gt;\r\n&amp;lt;/dependency&amp;gt;&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601ee0310&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;That’s it!!! You’d be good to go, just like we did in this enterprise grade use case sample below!!&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Real World Example: The Autonomous Transit Concierge&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To demonstrate the power of the Java SDK for MCP Toolbox combined with &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;AlloyDB&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, let’s look at a use case from the transportation sector.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Meet &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/agents/cymbal-transit" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cymbal Transit&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, a fictitious intercity bus network. Customers don't want to click through 15 dropdown menus to plan a trip. They want to ask:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"I need to get from New York to Boston tomorrow morning. Can I bring my Golden Retriever? If so, book me the fastest trip."&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To answer this, an AI agent must seamlessly cross-reference unstructured data (pet policies) with structured data (schedules, seat availability) and execute a transaction (booking)—all while remembering the context of the conversation.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Here is how we build this using the Java SDK for MCP Toolbox.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The Foundation: Database! (AlloyDB Schema with Native Embeddings)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;AlloyDB is the perfect engine for this because it handles relational data and high-dimensional AI vectors in a single query engine. Even better, AlloyDB can generate embeddings &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;natively&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; using the &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;google_ml_integration&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/alloydb/docs/ai/configure-vertex-ai#verify-installed-extension"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;extension&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, meaning your Java app doesn't have to shuffle text back and forth to an embedding API.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;First, set up AlloyDB cluster and instance by following this quick &lt;/span&gt;&lt;a href="https://codelabs.developers.google.com/quick-alloydb-setup" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;one-click deploy lab&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Then, set up database objects using the SQL statements below:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;quot;-- Enable necessary extensions for AI semantic search and embedding generation\r\nCREATE EXTENSION IF NOT EXISTS vector;\r\nCREATE EXTENSION IF NOT EXISTS google_ml_integration;\r\n\r\n-- Table 1: Transit Policies (Unstructured Data for RAG)\r\nCREATE TABLE transit_policies (\r\n    policy_id SERIAL PRIMARY KEY,\r\n    category VARCHAR(50),\r\n    policy_text TEXT,\r\n    policy_embedding vector(768) \r\n);\r\n\r\n-- Table 2: Intercity Bus Schedules (Structured Data)\r\nCREATE TABLE bus_schedules (\r\n    trip_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),\r\n    origin_city VARCHAR(100),\r\n    destination_city VARCHAR(100),\r\n    departure_time TIMESTAMP,\r\n    arrival_time TIMESTAMP,\r\n    available_seats INT DEFAULT 50,\r\n    ticket_price DECIMAL(6,2)\r\n);\r\n\r\n-- Table 3: Booking Ledger (Transactional Action Data)\r\nCREATE TABLE bookings (\r\n    booking_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),\r\n    trip_id UUID REFERENCES bus_schedules(trip_id),\r\n    passenger_id VARCHAR(100),\r\n    status VARCHAR(20) DEFAULT &amp;#x27;CONFIRMED&amp;#x27;,\r\n    booking_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP\r\n);&amp;quot;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601ee0970&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With our tables defined and vector support enabled, AlloyDB is now ready to serve as the unified brain for both our structured transactional data and semantic knowledge base.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Ingesting Records and Generating Real Embeddings&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We need a robust dataset to ensure our agent's context window has real options to reason over. Using PostgreSQL's powerful &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;generate_series &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;in AlloyDB, we can instantly seed our database with over 200 realistic bus trips for tomorrow. We have taken this approach to ingesting mock data for this demo application.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;-- 1. Insert Unstructured Policies and GENERATE REAL EMBEDDINGS natively in AlloyDB\r\nINSERT INTO transit_policies (category, policy_text, policy_embedding) \r\n... (refer repo for full statement)\r\n\r\n-- 2. Generate 200+ Realistic Schedules for the Next 7 Days using generate_series\r\nINSERT INTO bus_schedules (origin_city, destination_city, departure_time, arrival_time, ticket_price, available_seats)\r\n... (refer repo for full statement)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601ee0a00&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Refer to the &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/agents/cymbal-transit" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;repo&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for the full code.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Just like that, our database is dynamically populated with realistic schedules and natively generated embeddings, giving our AI agent immediate access to a rich, queryable environment.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Stateful Agent Architecture in Spring Boot&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The hardest part of building a conversational UI is &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;session management&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. If a user asks, &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"What times are available?"&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; and then says, &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;"Book the 8 AM one,"&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; the agent needs to remember the context.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Using the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Java MCP Toolbox SDK&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; with &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Spring Boot&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;LangChain4j&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, we can seamlessly maintain conversational memory in the HTTP Session and inject it into the agent's thought process. By pairing a modern frontend with this stateful backend, your enterprise application becomes a continuous, intelligent workspace.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Instead of writing massive if/else blocks to parse user intent, we simply define a declarative AI interface and bind our MCP tools to it. How simple it is to bring orchestration logic to the code without having to hard-code detailed queries or add blocks of static conditional code:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;interface TransitAgent {\r\n    @SystemMessage({\r\n        &amp;quot;You are the Cymbal Transit Concierge.&amp;quot;,\r\n        &amp;quot;Use the \&amp;#x27;querySchedules\&amp;#x27; tool for finding schedules.&amp;quot;,\r\n        &amp;quot;Use \&amp;#x27;bookTicket\&amp;#x27; to execute transactions.&amp;quot;,\r\n        &amp;quot;Use \&amp;#x27;searchPolicies\&amp;#x27; to look up luggage and pet rules.&amp;quot;\r\n    })\r\n    String chat(@MemoryId String sessionId, @UserMessage String userMessage);\r\n}\r\n\r\n@Service\r\nclass TransitAgentTools {\r\n    // These methods automatically call our MCP Toolbox for Databases server!\r\n    @Tool(&amp;quot;Query specific schedules between an origin and destination city.&amp;quot;)\r\n    public String querySchedules(String origin, String destination) { ... }\r\n\r\n    @Tool(&amp;quot;Book a ticket for a passenger.&amp;quot;)\r\n    public String bookTicket(String tripId, String passengerName) { ... }\r\n}&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601ee09a0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By cleanly separating the LLM prompt logic from the actual tool execution, LangChain4j ensures our agent remains focused, predictable, and remarkably easy to maintain over time. By pairing a modern frontend with this stateful Spring Boot backend, your enterprise application becomes a continuous, intelligent workspace rather than a series of disconnected prompts.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Mapping Intents to SQL: The tools.yaml&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The true magic of MCP Toolbox for Databases is how you define these custom tools. Your Java application doesn't need direct SQL access, nor does the LLM need to hallucinate table schemas. Instead, you provide the MCP server with a clean tools.yaml configuration. This file securely maps the agent’s tool calls directly to parameterized SQL statements in AlloyDB.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;tools:\r\n  query-schedules:\r\n    kind: postgres-sql\r\n    source: alloydb\r\n    description: Find available bus schedules between cities.\r\n    parameters:\r\n      - name: origin\r\n        type: string\r\n      - name: destination\r\n        type: string\r\n    statement: |\r\n      SELECT CAST(trip_id AS TEXT) AS trip_id, departure_time, ticket_price \r\n      FROM bus_schedules \r\n      WHERE lower(origin_city) = lower($1) AND lower(destination_city) = lower($2)\r\n\r\n  search-policies:\r\n    ... refer to the repo for the full tools.yaml&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601ee0eb0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With this simple declarative configuration, you've bridged the gap between natural language intents and complex SQL queries—without ever exposing your database schema directly to the LLM.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Connecting the Dots: Listing, Invoking, and Executing Tools in Java&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now that our database and tools are configured, the MCP Toolbox Java SDK handles the heavy lifting of interacting with them. The SDK provides an intuitive, type-safe API to securely discover, query, and execute transactions from your Spring Boot service.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;// 1. Initialize the Client\r\nMcpToolboxClient mcpClient = McpToolboxClient.builder()\r\n    .baseUrl(&amp;quot;https://toolbox-my-project-uc.a.run.app&amp;quot;)\r\n    .apiKey(myIdToken) \r\n    .build();\r\n\r\n// 2. Listing Discoverable Tools\r\nmcpClient.listTools().thenAccept(tools -&amp;gt; {\r\n    System.out.println(&amp;quot;Successfully discovered &amp;quot; + tools.size() + &amp;quot; tools.&amp;quot;);\r\n});\r\n\r\n// 3. Invoking a Tool (Read-Only Data)\r\nString schedules = mcpClient.invokeTool(&amp;quot;query-schedules&amp;quot;, Map.of(\r\n    &amp;quot;origin&amp;quot;, &amp;quot;New York&amp;quot;,\r\n    &amp;quot;destination&amp;quot;, &amp;quot;Boston&amp;quot;\r\n)).join().content().get(0).text();\r\n\r\n// 4. Executing a Transactional Tool (Requires Bound Authentication)\r\nAuthTokenGetter toolAuthGetter = () -&amp;gt; CompletableFuture.completedFuture(myIdToken);\r\n\r\nString bookingConfirmation = mcpClient.loadTool(&amp;quot;book-ticket&amp;quot;, Map.of(&amp;quot;google_auth&amp;quot;, toolAuthGetter))\r\n    .thenCompose(tool -&amp;gt; {\r\n        // Bind the authenticated user context securely\r\n        tool.bindParam(&amp;quot;passenger_name&amp;quot;, &amp;quot;Jane Doe&amp;quot;);\r\n        // Execute the mutable transaction\r\n        return tool.execute(Map.of(&amp;quot;trip_id&amp;quot;, &amp;quot;123e4567-e89b-12d3-a456-426614174000&amp;quot;));\r\n    }).join().content().get(0).text()&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601ee0ac0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;And there you have it—in just a few lines of type-safe Java, your Spring Boot application is securely discovering and executing remote tools as if they were local methods. Notice the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;bindParam&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; method above? This powerful feature allows you to securely inject application-level context (like the authenticated user's identity) directly into the database transaction, bypassing the LLM entirely. You can learn more about this in the MCP Toolbox Java SDK &lt;/span&gt;&lt;a href="https://github.com/googleapis/mcp-toolbox-sdk-java?tab=readme-ov-file#why-bind-parameters" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;documentation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Zero-Config Security with Application Default Credentials (ADC)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Notice there are no hardcoded secrets or JSON keys to manage! By fetching myIdToken using Google's Application Default Credentials (ADC) under the hood, your Java app automatically inherits its secure identity directly from the environment. Whether you are developing locally via the gcloud CLI or running in production, your application stays secure by default with zero manual credential configuration.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Deploying the Fleet to Cloud Run&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Remember those "Day 2" enterprise requirements we mentioned earlier? To successfully handle high concurrency, transactional integrity, and maintain stateful conversations at scale, your architecture needs to be robust. Because the MCP Toolbox for Databases and our Spring Boot Agent are fully decoupled, they can scale independently on Google Cloud Run to meet those exact demands.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;First, you deploy the open-source MCP Toolbox for Databases as its own secure service:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# Download the toolbox CLI\r\nexport VERSION=0.27.0\r\ncurl -L -o toolbox https://storage.googleapis.com/genai-toolbox/v$VERSION/linux/amd64/toolbox\r\nchmod +x toolbox&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601ee0190&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Next, deploy the local toolbox server to Cloud Run:&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Follow instructions &lt;/span&gt;&lt;a href="https://googleapis.github.io/genai-toolbox/how-to/deploy_toolbox" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in the official documentation.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Once the Toolbox is running, it acts as the secure, highly scalable bridge between your database and the outside world.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Next, deploy your Java Spring Boot Agent, injecting the dynamically generated MCP Toolbox URL and your Vertex AI settings as environment variables:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Set up the Cymbal Transit Agent App:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Clone the &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/agents/cymbal-transit" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;repo&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Then, you deploy your Java Spring Boot Agent, injecting the dynamically generated MCP Toolbox URL and your Vertex AI settings as environment variables:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;gcloud run deploy cymbal-transit \\\r\n  --source . \\\r\n  --allow-unauthenticated \\\r\n  --set-env-vars GCP_PROJECT_ID=my-project,GCP_REGION=us-central1,GEMINI_MODEL_NAME=gemini-2.5-flash,MCP_TOOLBOX_URL=https://toolbox-...&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3601ee0cd0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With a single command, your stateful, multi-agent enterprise application is live on Cloud Run, ready to securely orchestrate workflows on behalf of your users!&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;The Era of Hardcoded Integrations is Over&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The transition from stateless chatbots to autonomous, transactional agents is the defining technological shift of this decade. But agents are only as powerful as the systems they can securely interact with.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;With the MCP Toolbox for Databases Java SDK, enterprise developers finally have a native, elegant, and highly scalable way to give their AI agents read-and-write access to the mission-critical systems of record that run their business.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to build your own stateful enterprise agents? &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Explore the official &lt;/strong&gt;&lt;a href="https://github.com/googleapis/mcp-toolbox-sdk-java" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;MCP Toolbox Java SDK GitHub Repository&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to get started, and check out the demo application &lt;/span&gt;&lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/agents/cymbal-transit" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cymbal Bus Agent GitHub Repository&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to explore the complete source code and try it out today!&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Tue, 03 Mar 2026 09:29:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/announcing-the-mcp-toolbox-java-sdk/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/MCP_Toolbox_Java_SDK_Launch.max-600x600.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Announcing the MCP Toolbox Java SDK</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/MCP_Toolbox_Java_SDK_Launch.max-600x600.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/announcing-the-mcp-toolbox-java-sdk/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Abirami Sukumaran</name><title>Staff Developer Advocate, Google</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Anubhav Dhawan</name><title>Software Engineer, Google</title><department></department><company></company></author></item><item><title>Designing private network connectivity for RAG-capable gen AI apps</title><link>https://cloud.google.com/blog/products/networking/design-private-connectivity-for-rag-ai-apps/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The flexibility of Google Cloud allows enterprises to build secure and reliable architecture for their AI workloads. In this blog we will look at a reference architecture for &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/architecture/private-connectivity-rag-capable-gen-ai"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;private connectivity for retrieval-augmented generation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; (RAG)-capable generative AI applications. This architecture is for scenarios where communications of the overall system must use private IP addresses and must not traverse the internet.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;The power of RAG&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;RAG is a powerful technique used to optimize the output of large language models (LLMs) by grounding them in specific, authoritative knowledge bases outside of their original training data. RAG allows an application to retrieve relevant information from your documents, datasources, or databases in real time. This retrieved context is then provided to the model alongside the user’s query, helping to ensure that the AI’s responses are accurate, verifiable, and highly relevant to your business. This improves the quality of responses and reduces hallucinations. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This approach is helpful because it allows you to direct generative AI to use a designated source of truth, rather than relying solely on the model's pre-existing knowledge, and without needing to retrain or fine-tune the model itself. &lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Design pattern example&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To understand how to think about setting up your network for private connectivity for a RAG application in a regional design, let's look at the design pattern.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The setup comprises an &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;external network&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (on-prem and other clouds) and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Google Cloud environments&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; consisting of a &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;routing project&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, a &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Shared VPC host project for RAG&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, and three specialized service projects: &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;data ingestion&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;serving&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, and &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;frontend&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This design utilizes the following services to provide an end-to-end solution:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/network-connectivity/docs/interconnect/concepts/overview"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Interconnect&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; or &lt;/strong&gt;&lt;a href="https://docs.cloud.google.com/network-connectivity/docs/vpn/concepts/topologies#vpn-overview"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud VPN&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; To securely connect from your on-premises or other clouds to the routing VPC network&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/network-connectivity/docs/network-connectivity-center/concepts/overview"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Network Connectivity Center&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Used as an orchestration framework to manage connectivity between the routing VPC network and the RAG VPC network via VPC spokes and hybrid spokes&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/network-connectivity/docs/router/concepts/overview"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Router&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; In the routing project, facilitates dynamic BGP route exchange between the external network and Google Cloud&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://cloud.google.com/vpc/docs/private-service-connect"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Private Service Connect&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Provides a private endpoint in the routing VPC network to reach the Cloud Storage bucket for data ingestion without traversing the public internet&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/vpc/docs/shared-vpc"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Shared VPC&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Host project architecture that allows multiple service projects to use a common, centralized VPC network&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Google &lt;/strong&gt;&lt;a href="https://docs.cloud.google.com/armor/docs/cloud-armor-overview"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Armor&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; and Application &lt;/strong&gt;&lt;a href="https://docs.cloud.google.com/load-balancing/docs/application-load-balancer"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Load Balancer&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Placed in the frontend service project to provide security and traffic management for user interaction&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/security/vpc-service-controls"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;VPC Service Controls&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;:&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Creates a managed security perimeter around all resources to mitigate data exfiltration risks&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1-rag-gen-ai.max-1000x1000.png"
        
          alt="1-rag-gen-ai"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;The traffic flow &lt;/strong&gt;&lt;/h3&gt;
&lt;h4&gt;&lt;strong style="vertical-align: baseline;"&gt;RAG population flow&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the diagram, the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;green dashed line&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; shows the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;RAG population flow&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, which describes how data travels from data engineers to vector storage.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;From the external network, data travels over Cloud Interconnect or Cloud VPN.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;In the routing projects it uses the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Private Service Connect endpoint&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; to get to the Cloud Storage bucket.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;From the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Cloud Storage bucket&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; in the Data Ingestion service project, the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;data ingestion subsystem&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; processes the raw data. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The AI model creates vectors from the chunks, returns them to the data ingestion subsystem, which writes them to the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;RAG datastore&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; in the serving service project.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong style="vertical-align: baseline;"&gt;Inference flow&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the diagram, the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;orange dashed line&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; shows the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;inference flow&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, which describes customer or user requests.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The request travels over Cloud Interconnect or Cloud VPN to the routing VPC network and then over the VPC spoke to the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;RAG VPC network&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The request reaches the Application Load&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Balancer&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;protected by&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; Cloud Armor&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;; once allowed, it passes it to the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;frontend subsystem&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The frontend subsystem forwards the request to the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;serving subsystem&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, which augments the prompt with data from the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;RAG datastore&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; and generates a response via the AI model.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The system generates a response via the AI model, and the grounded response is returned along the same path to the requestor.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong style="vertical-align: baseline;"&gt;Management and routing&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the diagram, the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;blue dotted lines&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; represent the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Network Connectivity Center hybrid and VPC spokes&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; that manage the control plane and route orchestration between the routing network and the RAG VPC network. This ensures that routes learned from the external network are appropriately propagated across the environment.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Please read the entire architecture document &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/architecture/private-connectivity-rag-capable-gen-ai"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;Private connectivity for RAG-capable generative AI applications&lt;/strong&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to understand the specific including IAM permissions, VPC Service Controls, and deployment considerations.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;strong style="vertical-align: baseline;"&gt;Next steps&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Take a deeper dive into the Cross-Cloud Network, and other guides about generative AI with RAG:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Document set: &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/architecture/rag-reference-architectures"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Generative AI with RAG&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Document: &lt;/span&gt;&lt;a href="https://cloud.google.com/architecture/ccn-distributed-apps-design"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cross-Cloud Network for distributed applications &lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Blog: &lt;/span&gt;&lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/build-your-first-adk-agent-workforce?e=48754805"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Build Your First ADK Agent Workforce&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Want to ask a question, find out more or share a thought? Please connect with me on &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/ammett/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Linkedin&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;</description><pubDate>Mon, 02 Mar 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/products/networking/design-private-connectivity-for-rag-ai-apps/</guid><category>AI &amp; Machine Learning</category><category>Hybrid &amp; Multicloud</category><category>Developers &amp; Practitioners</category><category>Networking</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/0-rag-hero.max-600x600.png" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Designing private network connectivity for RAG-capable gen AI apps</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/0-rag-hero.max-600x600.png</image><site_name>Google</site_name><url>https://cloud.google.com/blog/products/networking/design-private-connectivity-for-rag-ai-apps/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Ammett Williams</name><title>Developer Relations Engineer</title><department></department><company></company></author></item><item><title>From "Vibe Checks" to Continuous Evaluation: Engineering Reliable AI Agents</title><link>https://cloud.google.com/blog/topics/developers-practitioners/from-vibe-checks-to-continuous-evaluation-engineering-reliable-ai-agents/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;I live through the same story with every single AI agent. After weeks of experiments and tests, it works like a charm. Suddenly, someone comes with a question that the agent fails to answer properly. I rush to make a change by tweaking one of the prompts. After a handful of tweaks, the failed prompt produces good results. I try a few of my favorite prompts and it works like a charm. Another new question, another perfect hit. I push it to production.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Less than 24 hours later, user reports start trickling in. The agent is hallucinating dates. It fails to cite sources for obscure topics. A little change that felt so solid ended up sabotaging dozens of other use cases that I haven't bothered to verify.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This is the &lt;/span&gt;&lt;strong style="font-style: italic; vertical-align: baseline;"&gt;vibe check trap&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;The Vibe Check Trap&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the classical software world, if you change a line of code, you run unit tests. The predicate &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;assert 2 + 2 == 4&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; will never statistically drift. Integration tests are more complex and flaky, but they're still largely stable in well-maintained projects. But in the world of Generative AI, we're building software on top of &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;probabilistic&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; foundations. A prompt that works 99% of the time today might work 92% of the time tomorrow just because the underlying model's weight distribution shifted slightly, or because the temperature parameter introduced a new token sequence. A minor change in the prompt or grounding data format might trigger a significant regression in the model's answers.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Relying on &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;vibe checks&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;—manually chatting with the agent to see if it feels right—is a recipe for disaster in production. It's subjective, unscalable, and susceptible to confirmation bias.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This guide is for software engineers who are ready to graduate from building demos to building production-grade AI systems. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In this post, we'll explore how to apply the engineering discipline of &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;continuous evaluation&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; (CE) for AI agents. With CE, you refine your agent's prompts, tools, and logic by using a combination of production monitoring, automated LLM-as-a-judge scoring, and human feedback. We'll show you how to apply CE using specific tools from the Google Cloud ecosystem: &lt;/span&gt;&lt;a href="https://google.github.io/adk-docs/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent Development Kit (ADK)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI Gen AI evaluation service&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Run&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;1. The Engineering Mindset: Discovery vs. Defense&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To organize our work effectively, we must distinguish between two fundamental modes of AI engineering. In traditional DevOps, these map roughly to Development and QA/Ops, but the distinction is sharper here due to the stochastic nature of large language models (LLMs). In AI engineering, these modes translate to &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;discovery mode&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;defense mode&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Discovery Mode (The Lab)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This is the creative phase. You're an explorer.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Activities&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Prompt engineering, tool selection, model selection.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Goal&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Raise the ceiling. You want to see if the model is &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;capable&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; of solving a complex reasoning task at least once.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Methodology&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Few-shot iteration&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Providing examples in the prompt to guide the model's behavior.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Red teaming&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Actively trying to break the model with adversarial inputs to find edge cases.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Vibe checks&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Yes, here they're useful! They help you build intuition about the model's personality and latency.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Outcome&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: A &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;golden prompt&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; that works perfectly for your specific reference examples.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Defense Mode (The Factory)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This is the industrialization phase. You're a reliability engineer.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Activities&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Regression testing, shadow traffic, monitoring.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Goal&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Protect the floor. You want to ensure that the average performance across 10,000 requests meets your Service Level Objectives (SLOs).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Methodology&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Dataset-driven evaluation&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Running the prompt against hundreds of diverse examples, not just the three you memorize.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Strict gating&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Automatically failing a build if the grounding score drops below 0.85.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Automated metrics&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: No humans involved in the loop.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Outcome&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: A deployed system that you can sleep through the night with.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Comparison Table&lt;/span&gt;&lt;/h3&gt;
&lt;div align="left"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;&lt;table&gt;&lt;colgroup&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;/colgroup&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Feature&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Discovery mode&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Defense mode&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Primary goal&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Innovation (new capabilities)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Stability (reliability)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Sample size&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;1 to 10 inputs&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;50 to 10,000 inputs&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Evaluation method&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Human eye (vibe check)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Automated evaluators (LLMs/code)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Latency tolerance&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;High (waiting for reasoning)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Low (SLO enforcement)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Cost sensitivity&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Low (development environments)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;High (production scale)&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;The failure mode&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Most teams stay in discovery mode forever. They treat every bug report as a reason to tweak the prompt, push it live, and pray. This creates a game of whack-a-mole where fixing one hallucination causes two more. To exit this trap, we need &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;regression testing&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, which we'll show you how to implement next.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;2. The Reference System: Architecture of a Course Creator&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To demonstrate the defense mode principles concretely, we'll analyze a &lt;/span&gt;&lt;a href="https://github.com/vladkol/agent-evaluation-lab" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Course Creator System&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. This isn't a single all-in-one agent, or a monolithic prompt trying to do everything: it's a distributed multi-agent system that's composed of multiple specialized agents. This architecture follows the principle of &lt;/span&gt;&lt;a href="https://en.wikipedia.org/wiki/Separation_of_concerns" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;separation of concerns&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-aside"&gt;&lt;dl&gt;
    &lt;dt&gt;aside_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;title&amp;#x27;, &amp;#x27;Check out the Agent Evaluation Lab repository&amp;#x27;), (&amp;#x27;body&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026429a0&amp;gt;), (&amp;#x27;btn_text&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;href&amp;#x27;, &amp;#x27;https://github.com/vladkol/agent-evaluation-lab&amp;#x27;), (&amp;#x27;image&amp;#x27;, None)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The system is built on &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Run&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for serverless scalability and it uses the&lt;/span&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt; &lt;/span&gt;&lt;a href="https://a2a-protocol.org/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Agent2Agent (A2A) Protocol&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; for standardized inter-agent communication.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The Agent Roster&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Each agent does its specific piece of work. The researcher collects information, the judge evaluates the collected data, the content builder composes it into a well-structured course, and the orchestrator controls this mighty team! &lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;1. The Researcher (The Hunter)&lt;/span&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Role&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Information retrieval.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Tools&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Custom &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;wikipedia_search&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Personality&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Objective, fact-focused.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Input&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: A query string (e.g., "history of neural networks").&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Output&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Text of the most relevant Wikipedia page.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Wikipedia Search Tool&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# agents/researcher/agent.py\r\ndef wikipedia_search(query: str) -&amp;gt; str:\r\n    &amp;quot;&amp;quot;&amp;quot;Searches Wikipedia for a given query.&amp;quot;&amp;quot;&amp;quot;\r\n    pages = search(query, results=1)\r\n    if pages:\r\n        return page(pages[0], auto_suggest=False).content\r\n    return &amp;quot;&amp;quot;&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026426a0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;2. The Judge (The Critic)&lt;/span&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Role&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Quality assurance.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Tools&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: None.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Personality&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Strict, pedantic.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Mechanism&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: It uses &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;structured output&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; (&lt;/span&gt;&lt;a href="https://docs.pydantic.dev/1.10/usage/models/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Pydantic objects&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;) to return a formal verification result.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Why is the judge a separate agent?&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; An agent detecting its own hallucinations is notoriously unreliable. A separate judge agent provides a necessary adversarial check.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;3. The Content Builder (The Writer)&lt;/span&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Role&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Synthesis and formatting.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Tools&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: None.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Personality&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Creative, educational.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Responsibility&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: It takes the raw, verified facts from the researcher and it structures them into a cohesive course module, such as "Introduction", "Chapter 1", etc.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;4. The Orchestrator (The Manager)&lt;/span&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Role&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Workflow management.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Mechanism&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: It implements a &lt;/span&gt;&lt;a href="https://google.github.io/adk-docs/agents/workflow-agents/sequential-agents/" rel="noopener" target="_blank"&gt;&lt;code style="text-decoration: underline; vertical-align: baseline;"&gt;SequentialAgent&lt;/code&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Logic&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ol&gt;
&lt;li aria-level="2" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Call the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;research loop&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ol&gt;
&lt;li aria-level="3" style="list-style-type: lower-roman; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Ask the researcher to gather data.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="3" style="list-style-type: lower-roman; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Ask the judge to evaluate data.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="3" style="list-style-type: lower-roman; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;If judge says "Fail"&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;: Send feedback to the researcher (restart the loop).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="3" style="list-style-type: lower-roman; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;If the judge says "Pass"&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;: Break the loop and continue to the next step.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;li aria-level="2" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Call the content builder to build the comprehensive course content.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Why is the orchestrator a separate agent?&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; Using a separate orchestrator agent isolates the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;control flow logic&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; from the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;generation logic&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;The Course-Building Multi-Agent System&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The architecture of the multi-agent system is set up like this:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/vibe-check-multi-agent-architecture.max-1000x1000.png"
        
          alt="vibe-check-multi-agent-architecture"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This multi-agent system has a nice web app (in the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;app&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; folder of the repository). A little service exposes a frontend and calls the orchestrator agent service by using our ADK FastAPI integration. The user request flow looks like this:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--medium
      
      
        h-c-grid__col
        
        h-c-grid__col--4 h-c-grid__col--offset-4
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/vibe-check-user-request-flow.max-1000x1000.jpg"
        
          alt="vibe-check-user-request-flow"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The A2A Protocol Benefits&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;a href="https://a2a-protocol.org/" rel="noopener" target="_blank"&gt;&lt;span style="vertical-align: baseline;"&gt;Agent2Agent (A2A) Protocol&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; standardizes how these agents communicate with &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;each other&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. Instead of wrapping the researcher as a function call or a generic tool within the orchestrator's prompt (which limits its capabilities), A2A lets the orchestrator interact with the researcher as a full peer service.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This protocol solves the "N × N" integration problem. All agents speak the same language (HTTP + JSON schemas), making the system modular and easy to extend. If we want to replace the researcher with a different implementation, the orchestrator doesn't need to change a single line of code.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This modularity is also the key to our evaluation strategy. Because the agents are loosely coupled services, we don't have to evaluate the entire system at once. Instead, we can target individual components.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Shared Architecture Components&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To make this distributed system reliable and observable, we use a set of shared utility components across all of our agents:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;shared/adk_app.py&lt;/strong&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;: This is the backbone of every agent service. It builds on top of &lt;/span&gt;&lt;a href="https://google.github.io/adk-docs/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;ADK &lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;and &lt;/span&gt;&lt;a href="https://fastapi.tiangolo.com/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;FastAPI&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. It automatically configures these components:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;A2A middleware&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Handles the exchange of &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;agent cards&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; or self-description, dynamically rewriting URLs to match the current deployment. A2A middleware is useful for &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;shadow revisions&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, where you deploy a new version of your agent to handle simulated traffic from an evaluation pipeline.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://opentelemetry.io/" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;OpenTelemetry&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; middleware&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Automatically captures every incoming request as a &lt;/span&gt;&lt;a href="https://opentelemetry.io/docs/specs/otel/trace/api/#span" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Trace Span&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;shared/traced_authenticated_httpx.py&lt;/strong&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;: A hardened HTTP client for inter-agent communication.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Authentication&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Handles the complexities of Google Cloud service-to-service authentication (OIDC tokens), ensuring zero-trust security between agents.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Trace propagation&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Injects the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;traceparent&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; header into every outgoing request. The header context lets Cloud Trace stitch together the graph that shows how the orchestrator called the researcher. We'll discuss that more later, when we take a look at distributed tracing.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code&gt;&lt;strong style="vertical-align: baseline;"&gt;shared/a2a_utils.py&lt;/strong&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;: Provides the logic for dynamic &lt;/span&gt;&lt;a href="https://a2a-protocol.org/latest/tutorials/python/3-agent-skills-and-card/#agent-card" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;agent cards&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. In Cloud Run, a service might be accessed through a public URL or through a revision-specific URL. This utility ensures that the agent always tells its peers the correct address to call back.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;3. The Evaluation Taxonomy: A Deep Dive&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Before we write code, we must define our units of measurement. "Is this agent good?" isn't a valid engineering question. We need to define "good" as testable dimensions. To do that, we can categorize evaluation metrics into a hierarchy of sophistication.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Level 1: Computation-Based Metrics&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These are deterministic checks against a &lt;/span&gt;&lt;a href="https://en.wikipedia.org/wiki/Ground_truth" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;ground truth&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; or rigid rules. They're the closest to traditional software unit and integration tests.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Classic NLP metrics like &lt;/strong&gt;&lt;a href="https://en.wikipedia.org/wiki/ROUGE_(metric)" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;ROUGE&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt; and &lt;/strong&gt;&lt;a href="https://en.wikipedia.org/wiki/BLEU" rel="noopener" target="_blank"&gt;&lt;strong style="text-decoration: underline; vertical-align: baseline;"&gt;BLEU&lt;/strong&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Is the result sufficiently similar to the reference answer?&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;JSON validity&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: If your agent must output JSON, does it parse?&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Prohibited phrases&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Does the output contain "I am an AI language model"?&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Exact match or Regex&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: For extraction tasks (e.g., getting a date &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;YYYY-MM-DD&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;), does the output match the pattern?&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Execution trajectory (including agent tool trajectory)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Did the agent call certain tools in a particular order with specific parameters? &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;Reference-based or Reference-free metrics&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Some metrics require a ground truth reference answer to compare the result to. &lt;/span&gt;&lt;a href="https://en.wikipedia.org/wiki/ROUGE_(metric)" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;ROUGE&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; and &lt;/span&gt;&lt;a href="https://en.wikipedia.org/wiki/BLEU" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;BLEU&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; are perfect examples of that. Other metrics might evaluate the result on different criteria, such as output format ("Did the agent produce correct JSON?") or prohibited words—they don't need a ground truth answer for comparison.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Level 2: Rubric-Based Metrics&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This is the standard for semantic evaluation. We use a powerful LLM-as-a-judge model (like &lt;/span&gt;&lt;a href="https://deepmind.google/models/gemini/pro/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Gemini Pro&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;) to grade the agent's output. To evaluate agent answers, these metrics use sophisticated battle-tested and rigorously maintained prompts.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These are the core concepts that are related to LLM-based evaluation metrics:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Rubrics&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The criteria for how to rate the response of an LLM model or application. Basically, it's a composite prompt that can be pre-defined or dynamically generated.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Metrics&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: A score that measures the model output against the rating rubrics.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;&lt;span style="vertical-align: baseline;"&gt;Rubric-based metrics&lt;/span&gt;&lt;/em&gt;&lt;span style="vertical-align: baseline;"&gt;&lt;em&gt; &lt;/em&gt;incorporate LLMs into these kinds of evaluation workflows:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Adaptive rubrics&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Rubrics are dynamically generated for each prompt. Responses are evaluated with granular, explainable pass or fail feedback that's specific to the prompt.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Static rubrics&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Rubrics are defined explicitly and the same rubric applies to all prompts. Responses are evaluated with the same set of numerical scoring-based evaluators, with a single numerical score (such as 1 to 5) per prompt. Static rubrics are used when the exact same criteria is required across all prompts (e.g., "Check whether the agent answered the user's question").&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Just like with computation-based metrics, rubric-based metrics might or might not require a ground truth reference. It's critical to choose the right one for your task.&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;A. Reference-Free Metrics&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You don't have a specific correct answer, but you rely on general principles of quality. Use this approach for open-ended generation like emails, poems, or generic advice. Examples of reference-free metrics include these:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Response quality&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: A comprehensive and adaptive rubrics metric that evaluates the overall quality of an agent's response as follows:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;It automatically generates a broad range of criteria based on the agent configuration (developer instruction and declarations for tools that are available to the agent) and the user's prompt.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Then it assesses the generated criteria based on tool usage in intermediate events and the final answer by the agent.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Coherence&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Evaluates whether the text is logical and grammatically correct.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Safety&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Evaluates whether the text violates safety policies, such as by inclusion of hate speech or personally identifiable information (PII).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Instruction-following&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: A targeted and adaptive rubrics metric that measures how well the response adheres to the specific constraints and instructions that are given in the prompt.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;B. Reference-Based Metrics&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;You have a &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;golden answer&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; and you want to ensure that the agent's response creates the same meaning, even if it's phrased differently. If the reference is "Paris" and the agent says "Capital of France, which is Paris", a regex might fail, but an LLM judge will pass it. A reference-based metric checks for a &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;response match&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; to determine whether the answer matches the reference response or ground truth.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Level 3: Vertex AI Managed Metrics&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Google's &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI Gen AI evaluation service&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; provides pre-built, calibrated models called &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;autoraters&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; for these dimensions. Autoraters are superior to creating your own judge prompt because they're benchmarked against human raters and they're maintained by Google. These are a few examples of autoraters:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code style="vertical-align: baseline;"&gt;GROUNDING&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;: The most critical metric for RAG. It takes &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;context&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; + &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;response&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and checks whether the response is fully supported by the context. It assigns a score from 0 to 1.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code style="vertical-align: baseline;"&gt;SAFETY&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;: Automatically flags hate speech, harassment, and dangerous content.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code style="vertical-align: baseline;"&gt;TOOL_USE_QUALITY&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;: Specifically tailored for agents to evaluate whether the agent made an appropriate tool call and whether the argument was correct. &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;This metric doesn't require comparison to a tool call reference that's considered correct&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. Instead, the evaluator makes a judgment based on the tool description, the agent description, and the context of the conversation. &lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For more information, see the complete list of &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/rubric-metric-details"&gt;&lt;span style="vertical-align: baseline;"&gt;managed rubric-based metrics in Vertex AI&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Static and Adaptive Rubrics&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Static rubrics like "Rate helpfulness 1-5" suffer from high variance. &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Adaptive rubrics&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; solve that issue by dynamically generating a test case for &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;each prompt&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Rubric generation&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The system analyzes the user prompt and reference.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Prompt&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;: "Compare the battery life of Pixel 9 and iPhone 16."&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;System generates criteria&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="3" style="list-style-type: square; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Criteria 1: Mentions Pixel 9 mAh?&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="3" style="list-style-type: square; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Criteria 2: Mentions iPhone 16 video playback hours?&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="3" style="list-style-type: square; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Criteria 3: Is neutral?&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Rubric grading&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The judge checks these boolean conditions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This turns a subjective vibe check into an objective report card that explains &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;exactly&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; what was missing. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Gen AI evaluation service&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; provides a comprehensive set of metrics that are based on static and adaptive rubrics. You can also create your own adaptive rubrics by using the &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;GenAI Client in Vertex AI SDK&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.    &lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;4. The Fuel: Designing Your Evaluation Dataset&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Garbage in, garbage out. Your evaluation is only as good as your dataset. A proper evaluation dataset is a collection of examples (rows). In our system, we use a JSON-based format where columns represent different inputs and expected outputs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The following is an actual example from our &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;evaluator/eval_data_researcher.json&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; dataset. It's structured as columns for efficient &lt;/span&gt;&lt;a href="https://pandas.pydata.org/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;pandas&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; loading:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;{\r\n    &amp;quot;prompt&amp;quot;: {\r\n        &amp;quot;0&amp;quot;: &amp;quot;History of Rome&amp;quot;,\r\n        &amp;quot;1&amp;quot;: &amp;quot;Pythagorean theorem&amp;quot;\r\n    },\r\n    &amp;quot;reference&amp;quot;: {\r\n        &amp;quot;0&amp;quot;: &amp;quot;# The History of Rome\\n\\nThe history of Rome spans over 2,500 years...&amp;quot;,\r\n        &amp;quot;1&amp;quot;: &amp;quot;## The Pythagorean Theorem: A Cornerstone of Geometry...&amp;quot;\r\n    },\r\n    &amp;quot;reference_trajectory&amp;quot;: {\r\n        &amp;quot;0&amp;quot;: [\r\n            {\r\n                &amp;quot;tool_name&amp;quot;: &amp;quot;wikipedia_search&amp;quot;,\r\n                &amp;quot;tool_input&amp;quot;: { &amp;quot;query&amp;quot;: &amp;quot;History of Rome&amp;quot; }\r\n            }\r\n        ],\r\n        &amp;quot;1&amp;quot;: [\r\n            {\r\n                &amp;quot;tool_name&amp;quot;: &amp;quot;wikipedia_search&amp;quot;,\r\n                &amp;quot;tool_input&amp;quot;: { &amp;quot;query&amp;quot;: &amp;quot;Pythagorean theorem&amp;quot; }\r\n            }\r\n        ]\r\n    }\r\n}&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3602642a90&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Components of the Dataset&lt;/span&gt;&lt;/h3&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;prompt&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The input to the agent.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Example&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;: "What is the return policy for item #123?"&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;reference&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt; (optional)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The ideal answer (for reference-based metrics).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Example&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;: "Item #123 can be returned within 30 days."&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;reference_trajectory&lt;/strong&gt;&lt;strong style="vertical-align: baseline;"&gt; (optional)&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The simple gold standard for tool usage.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;This option lets us verify whether our agent is thinking correctly. If the prompt asks for "Population of Tokyo" and the trajectory shows a call to &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;get_weather("Tokyo")&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, the agent has failed fundamentally, even if it hallucinates the correct population number.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Best practice&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Store this dataset in BigQuery or in a JSON file in Cloud Storage. Treat it like source code. Version it.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Just before we call the agent with these inputs, we'll add one more column with the same value for every input prompt: &lt;/span&gt;&lt;code&gt;&lt;span style="vertical-align: baseline;"&gt;session_inputs&lt;/span&gt;&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;,  a simple structure with evaluation user ID, agent name, and an empty state dictionary.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;5. The Implementation: Building the Evaluation Engine&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When we have a dataset and metrics, we need an engine to drive the tests. A simple Python script isn't enough; we need to replicate the scale of production. To accomplish that, we build an evaluation runner (&lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;evaluate_agent.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;) that uses the GenAI Client in Vertex AI SDK.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;A. Parallel Inference&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Agent operations are slow relative to typical API calls. A multi-step reasoning task might take 15 seconds. If our &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;golden dataset&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; has 500 examples, running them sequentially would take 2 hours. We use Python's &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;asyncio&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to run many concurrent requests against a shadow revision.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;This approach reveals another benefit of evaluating agents that are deployed to Cloud Run: unlike your developer machine, it can scale to serve parallel requests. Faster evaluation enables faster iterations.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;shared/evaluation/evaluate.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; module, we implement a throttled parallel runner:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;async def run_parallel_inference(client, prompts, shadow_url):\r\n    tasks = []\r\n    # Semaphore to prevent DDOSing our own service or hitting Rate Limits\r\n    # We limit to 10 concurrent requests to match our Cloud Run capacity\r\n    sem = asyncio.Semaphore(10)\r\n\r\n    for prompt in prompts:\r\n        # Each task runs the full HTTP Post -&amp;gt; SSE Stream -&amp;gt; Events Capture\r\n        tasks.append(_run_inference(sem, client, shadow_url, prompt))\r\n\r\n    return await asyncio.gather(*tasks)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3602642d60&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;code style="vertical-align: baseline;"&gt;_run_inference&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; calls the &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;ADK server API endpoint&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;POST /run_sse&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; of the target agent. This call initiates a streaming session where the agent pushes events while it thinks. The events are captured and processed to store the final answer and a list of the intermediate events—the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;reasoning trace&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;B. Reasoning Trace Capture&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If the agent fails, &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;why&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; did it fail? In standard evaluation, you only see the final answer. In agentic evaluation, we need the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;reasoning trace&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; or execution history. It includes a list of events that occurred during the agent's execution. Each event has a type and a payload. The payload contains the event's content and metadata. The most interesting events are the tool calls. They include tool call requests from the LLM (with parameter values), and tool call responses from the tools (with return values).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We pass this entire trace to the Gen AI evaluation service. This allows for questions like: "&lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Did the agent hallucinate the number 14 million, or did the tool actually return it?&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;"&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We also use this trace for tool trajectory evaluation, which we describe later in this post.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;C. Final Evaluation Dataset&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;After we run the agent for every prompt, we add two more columns to the evaluation dataset. We already have &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;prompt&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;reference&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; (optional ground truth answer), &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;reference_trajectory&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; (optional ground truth for tool calls and their parameters), and &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;session_inputs&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. After inference, we add these columns: &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code style="vertical-align: baseline;"&gt;response&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;: The actual final response of the agent.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code style="vertical-align: baseline;"&gt;intermediate_events&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;: All events that preceded the final response, including tool calls.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;D. Runtime Schema Integration&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;To verify whether the agent used tools correctly, the evaluation scorer can leverage the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;tool definition&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. If we hardcode this definition in our test suite, it will drift from the actual code. Instead, we expose an &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;/agent-info&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; endpoint on every agent, and the evaluator fetches it at runtime.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# Fetch live schema from the running service\r\nagent_info_response = await httpx_client.get(f&amp;quot;{agent_api_server}/apps/{agent_name}/agent-info&amp;quot;)\r\nagent_info = types.evals.AgentInfo.model_validate_json(agent_info_response.content)\r\n\r\n# Create the Evaluation Run in Vertex AI\r\nevaluation_run = client.evals.create_evaluation_run(\r\n    dataset=agent_dataset_with_inference,\r\n    agent_info=agent_info,  # Contains the LIVE tool definitions/schema\r\n    metrics=metrics,\r\n    dest=evaluation_storage_uri\r\n)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3602642190&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This implementation ensures that if you add a new tool parameter in your code, the evaluation automatically knows about it without manual updates to the test suite.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;6. Custom Function Metrics Deep Dive: Tool Trajectory Evaluation&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;For many agents, tool usage isn't an option, but is a mandatory part of their flow. Evaluating general tool usage quality isn't enough. You need to have strictly defined business rules.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Rule 1&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;: "Wikipedia tool must always be called."&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;Rule 2&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;: "It must be called with the correct search request."&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We can enforce these by using custom metrics in Gen AI evaluation service. We write a Python function, and the service executes it in a secure sandbox against every row of our evaluation.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Implementing Tool Trajectory Metrics&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In our &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;shared/evaluation/tool_metrics.py&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; module, we implemented multiple custom metrics for tool trajectory evaluation. &lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Trajectory precision&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The agent called 5 tools. 3 were useful, 2 were noise. Precision = 3/5.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Trajectory recall&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The task required checking Database &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;and&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; Wiki. The agent only checked Wiki. Recall = 0.5.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Exact order match&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The agent called the right tools in the specified order, without calling any other tools.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;In-order match&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The agent called the right tools in the specified order, even if other tools were called in between.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Any-order match&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The agent called the right tools in any order.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The core logic relies on comparing the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;predicted trajectory&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; (what the agent did) against the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;reference trajectory&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; (what we wanted it to do).&lt;/span&gt;&lt;/p&gt;
&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;1. The Reference (From Dataset)&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;reference_trajectory&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; looks like a clean list of expected calls:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;[\r\n  {\r\n    &amp;quot;tool_name&amp;quot;: &amp;quot;wikipedia_search&amp;quot;,\r\n    &amp;quot;tool_input&amp;quot;: { &amp;quot;query&amp;quot;: &amp;quot;History of Rome&amp;quot; }\r\n  }\r\n]&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3602642730&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h4&gt;&lt;span style="vertical-align: baseline;"&gt;2. The Event Trace (From Agent)&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We use captured &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;intermediate_events&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to extract the actual function calls so that we can compare them to the reference trajectory.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;// One event in the stream\r\n{\r\n  &amp;quot;content&amp;quot;: {\r\n    &amp;quot;parts&amp;quot;: [{ &amp;quot;function_call&amp;quot;: {\r\n        &amp;quot;name&amp;quot;: &amp;quot;wikipedia_search&amp;quot;,\r\n        &amp;quot;args&amp;quot;: { &amp;quot;query&amp;quot;: &amp;quot;History of Rome&amp;quot; }\r\n    }}]\r\n  }\r\n}&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026428e0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The helper function &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;_get_tool_calls&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; extracts the list from the trace and compares it to the reference.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Executing Custom Function Metrics in Vertex AI&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;These custom metrics require running Python code. Where does the code run? Aren't we using Vertex AI for the evaluation?&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Yes, Gen AI evaluation service takes care of running that code. The service expects Python code with an &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;evaluate&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; function:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;def evaluate(\r\n    instance: dict\r\n) -&amp;gt; float:\r\n    ...&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3602642580&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Our functions have other functions that they depend on, so we package the module's source code with them and we construct an extra &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;evaluate&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; function to make a call. &lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;import inspect\r\nmodule_source = inspect.getsource(\r\n    inspect.getmodule(metrics_function)\r\n)\r\nmodule_source += (\r\n    &amp;quot;\\n\\ndef evaluate(instance: dict) -&amp;gt; float:\\n&amp;quot;\r\n    f&amp;quot;    return {metrics_function.__name__}(instance)\\n&amp;quot;\r\n)\r\nreturn types.EvaluationRunMetric(\r\n    metric=metric_name,\r\n    metric_config=types.UnifiedMetric(\r\n        custom_code_execution_spec=types.CustomCodeExecutionSpec(\r\n            remote_custom_function=module_source\r\n        )\r\n    )\r\n)&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36026420d0&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We package these functions using &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;CustomCodeExecutionSpec&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; and send them to Vertex AI for sandboxed execution. This approach lets us combine the flexibility of custom Python code with the massive scale of managed evaluation.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;7. Strategy: Shadow Deployments &amp;amp; Safe Rollouts&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The most common fear in deploying AI agents is: "If I change the prompt, will it break for 10% of users?" This fear can paralyze teams. To solve this issue, we borrow a technique from standard microservice engineering: &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;shadow deployments&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; or &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;dark canaries&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The Concept&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Instead of replacing the live version of your agent, you deploy a new version alongside it.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Live revision&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Serves 100% of user traffic.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Shadow revision&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Serves 0% of user traffic but handles simulated traffic from your evaluation pipeline.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;This decoupling of &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;deployment&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; (code on server) from &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;release&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; (users see code) lets you test in the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;exact&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; production environment—same network, same secrets, same latency characteristics—without risk.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud Run Implementation&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud Run makes implementation trivial. Every deployment creates a revision. We can assign a tag to a revision to give it a unique URL. We use the &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Git commit short SHA&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; hash as the tag (e.g., &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;sha-a1b2c3d&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;). This tag creates an immutable link between your source code and your running service. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;In our &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;deploy.sh&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, we use the following logic to deploy a shadow revision:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# 1. Capture the commit SHA\r\nexport COMMIT_SHA=$(git rev-parse --short HEAD)\r\nexport REVISION_TAG=&amp;quot;sha-${COMMIT_SHA}&amp;quot;\r\n# 2. Deploy with --no-traffic\r\n# This tells Cloud Run: &amp;quot;Start the container, but don\&amp;#x27;t route public requests here.&amp;quot;\r\ngcloud run deploy researcher \\\r\n  --image gcr.io/${GOOGLE_CLOUD_PROJECT}/researcher:latest \\\r\n  --region us-central1 \\\r\n  --no-traffic \\\r\n  --tag &amp;quot;${REVISION_TAG}&amp;quot;&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3602642a60&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Result&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Public URL&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;https://researcher-xyz.run.app&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; (unchanged, safe).&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Shadow URL&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;https://&lt;strong style="vertical-align: baseline;"&gt;sha-a1b2c3d---&lt;/strong&gt;researcher-xyz.run.app&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; (new, testing ground). Three dashes &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;---&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;separate the revision part.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When a user hits the service public URL, Cloud Run distributes traffic between revisions that are configured as serving the traffic. This new shadow revision doesn't serve any requests unless it's called with a revision-specific shadow URL. You can view and &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/managing/revisions"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;manage revisions&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; in the Google Cloud console.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/vibe-check-revisions.max-1000x1000.jpg"
        
          alt="vibe-check-revisions"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Your continuous evaluation pipeline then targets this shadow URL. If the shadow revision metrics pass, we can run a promotion command that makes the successful revision serve the traffic:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;# Promotion command (only run after evaluation passes)\r\ngcloud run services update-traffic researcher --to-tags ${REVISION_TAG}=100&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f3602642d90&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;It doesn't have to always be 100% to a single revision. No matter how much we test and evaluate our code, mistakes happen. Instead of switching all at once, you might want to &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/run/docs/rollouts-rollbacks-traffic-migration"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;gradually migrate traffic&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; between revisions.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;8. Analyzing Evaluation Results&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;When the pipeline breaks, developers don't dig through text logs. Using the Run ID from the build log, they can pull the full report.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-code"&gt;&lt;dl&gt;
    &lt;dt&gt;code_block&lt;/dt&gt;
    &lt;dd&gt;&amp;lt;ListValue: [StructValue([(&amp;#x27;code&amp;#x27;, &amp;#x27;from google.genai import types as genai_types\r\nfrom vertexai import Client\r\n# Initialize SDK\r\nclient = Client(\r\n    project=GOOGLE_CLOUD_PROJECT,\r\n    location=GOOGLE_CLOUD_REGION,\r\n    http_options=genai_types.HttpOptions(api_version=&amp;quot;v1beta1&amp;quot;),\r\n)\r\n\r\nevaluation_run = client.evals.get_evaluation_run(\r\n    name=EVAL_RUN_ID,\r\n    include_evaluation_items=True\r\n)\r\nevaluation_run.show()&amp;#x27;), (&amp;#x27;language&amp;#x27;, &amp;#x27;lang-py&amp;#x27;), (&amp;#x27;caption&amp;#x27;, &amp;lt;wagtail.rich_text.RichText object at 0x7f36006bc730&amp;gt;)])]&amp;gt;&lt;/dd&gt;
&lt;/dl&gt;&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Acting on Evaluation Results&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;If the evaluation fails, the build fails. The build fails the pipeline. The pipeline fails the commit. The commit fails the PR. The PR fails the merge. The merge fails the release. The release fails the deployment. The users don't use the failed code.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Now, how can we understand why it failed? Let's take a closer look at an example of the evaluation run.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The request was &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;History of Rome&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. The researcher provided the content of the Wikipedia page &lt;/span&gt;&lt;a href="https://en.wikipedia.org/wiki/History_of_Rome" rel="noopener" target="_blank"&gt;&lt;span style="vertical-align: baseline;"&gt;History of Rome&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. The content seemed good to the judge, but the final &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;hallucination&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; metric was too low.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/vibe-check-eval-01.max-1000x1000.jpg"
        
          alt="vibe-check-eval-01"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The reason is because the final course that was built by the content builder contained facts that weren't present in the Wikipedia page. By looking at the reasoning trace, we can see that the researcher used the Wikipedia page as a source of information.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;However, the content builder was too creative about the course content. The content builder used a Gemini model that certainly knows a lot about Rome, so it enhanced the course with facts that weren't present in the Wikipedia page.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;How do we fix that? Let's tell the content builder to stick to the facts that are provided by the researcher.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/vibe-check-eval-02.max-1000x1000.jpg"
        
          alt="vibe-check-eval-02"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;And voilà! The very next run produced a perfect evaluation score.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;What's important is that the rest of the metrics are still good. We found a problem and we made changes to fix it, but the whole system stayed intact.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;9. Automating the Loop: The CI/CD Pipeline&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Finally, we operationalize this solution by using &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/build/docs"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Build&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;. The goal is a &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;quality firewall&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; that helps makes sure that bad code can't physically reach production users. &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;Our &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;.cloudbuild/cloudbuild.yaml&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; orchestrates the lifecycle:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Build&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Docker builds the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;researcher&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; image.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Push to Artifact Registry.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Deploy shadow&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code style="vertical-align: baseline;"&gt;gcloud run deploy ... --tag=sha-${COMMIT_SHA} --no-traffic&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Evaluate&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Run &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;python -m evaluator.evaluate_agent&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;The script targets the shadow URL.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;It uploads results to Vertex AI.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;The gate&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: It checks &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;if metric_score &amp;lt; THRESHOLD&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;. If &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;true&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt;, it exits with error code 1, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;failing the build&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Promote &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;(only runs if evaluate passes):&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;code style="vertical-align: baseline;"&gt;gcloud run services update-traffic ... --to-tags sha-${COMMIT_SHA}=100&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/ol&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/vibe-check-continuous-eval.max-1000x1000.jpg"
        
          alt="vibe-check-continuous-eval"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;10. Distributed Tracing with OpenTelemetry&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Debugging a single monolithic LLM call might be easy. Debugging a distributed system of multiple agents, each making its own LLM calls and tool executions, is exponentially harder. If the orchestrator gives a wrong answer, was it bad logic in the orchestrator? Did the researcher return bad data? Or did a network timeout cause a fallback? To answer these questions, logging isn't enough. We need &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;distributed tracing&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;We use &lt;/span&gt;&lt;a href="https://opentelemetry.io/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;OpenTelemetry (OTel)&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to instrument every part of the stack, capturing the entire lifecycle of a request as a Trace graph.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;The Waterfall View in Cloud Trace&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;By integrating with &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/trace/docs"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Cloud Trace&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, we get a visual waterfall of every operation.&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Root span&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: The initial request to the web app's backend and to the orchestrator.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;Child spans&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Cross-service A2A requests to other agents, LLM invocations, and tool executions.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The Trace graph lets us see the system's physical execution alongside the logical reasoning.&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/vibe-check-trace-graph.max-1000x1000.png"
        
          alt="vibe-check-trace-graph"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Enabling End-to-End Tracing with Shared Components&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;ADK comes with built-in OpenTelemetry support. However, to get a truly unified view across our microservices, we enhanced it with our shared components:&lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;shared/adk_app.py&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Wraps the standard ADK &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;FastAPI&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; app.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Adds &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;OpenTelemetryMiddleware&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; to automatically start a trace span for every incoming HTTP request.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Correctly extracts the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;traceparent&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; context from incoming headers, connecting this agent's work to the caller's trace.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;li aria-level="1" style="list-style-type: decimal; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;strong style="vertical-align: baseline;"&gt;shared/traced_authenticated_httpx.py&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;:&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;ul&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;When an agent calls another agent (e.g., orchestrator -&amp;gt; researcher), we must propagate the trace ID.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="2" style="list-style-type: circle; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;This custom client injects the OTel &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;traceparent&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; header into outgoing requests.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/ol&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;System Traces vs. Reasoning Traces&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;It's crucial to distinguish between the two types of traces that we discuss in this post:&lt;/span&gt;&lt;/p&gt;
&lt;div align="left"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;
&lt;div style="color: #5f6368; overflow-x: auto; overflow-y: hidden; width: 100%;"&gt;&lt;table&gt;&lt;colgroup&gt;&lt;col/&gt;&lt;col/&gt;&lt;col/&gt;&lt;/colgroup&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Feature&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Reasoning trace (intermediate events)&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;OpenTelemetry trace (system trace)&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Source&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The agent's thought process (SSE stream).&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The code's execution path (FastAPI, HTTPX, ADK).&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Content&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;"I should search for ...", tool definition, tool output.&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Latency, HTTP status codes, service errors.&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Goal&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Agent observability&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Did the agent make the right plan? What parameters did it use?&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;System observability&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: What service was called? Which service failed? Was it slow?&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;strong style="vertical-align: baseline;"&gt;Storage&lt;/strong&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Vertex AI Gen AI evaluation service (JSON).&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td style="vertical-align: top; border: 1px solid #000000; padding: 16px;"&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud Trace (waterfall UI).&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Reasoning traces give you visibility into the cognitive process of your agent, while system traces show how API requests flow through your system.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Debugging Non-Deterministic Systems&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The combination of these types of traces is your superpower. When an evaluation fails (e.g., "grounding score &amp;lt; 0.5"), you look at the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;reasoning trace&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; to see &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;what&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; the model thought. If the reasoning looks correct but the result is wrong (e.g., a tool error), you switch to the &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;OpenTelemetry trace&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; in Cloud Trace. You might find that the &lt;/span&gt;&lt;code style="vertical-align: baseline;"&gt;wikipedia_search&lt;/code&gt;&lt;span style="vertical-align: baseline;"&gt; timed out after 5000ms, causing the model to hallucinate an answer because it lacked data.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Without this x-ray vision insight into both the cognitive and physical layers of your system, you're debugging in the dark.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Conclusion&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;Stop messing with the vibe checks. Instead, use the power of &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;evaluated intelligence&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;. Building reliable AI agents requires a shift in mindset from discovery to defense. By implementing &lt;/span&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;continuous evaluation (CE)&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;, we treat agentic systems with the rigor that they deserve.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;This post explores concepts from the codelab &lt;/span&gt;&lt;a href="https://codelabs.developers.google.com/codelabs/production-ready-ai-roadshow/2-evaluating-multi-agent-systems/evaluating-multi-agent-systems" rel="noopener" target="_blank"&gt;&lt;span style="font-style: italic; text-decoration: underline; vertical-align: baseline;"&gt;From "vibe checks" to data-driven Agent Evaluation&lt;/span&gt;&lt;/a&gt;&lt;span style="font-style: italic; vertical-align: baseline;"&gt;. To run the code yourself, check out the codelab.&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Resources and Links&lt;/span&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://cloud.google.com/run/docs"&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud Run&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; runs and scales your AI agents, isolates failure domains, and enables zero-risk shadow deployments.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview"&gt;&lt;span style="vertical-align: baseline;"&gt;Vertex AI Evaluation&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; provides the managed metrics, the adaptive rubrics, and the compute scaling to run them without managing infrastructure.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://cloud.google.com/build/docs"&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud Build for CI/CD&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; creates a quality firewall, which helps guarantee that no regression goes unnoticed.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;a href="https://docs.cloud.google.com/trace/docs"&gt;&lt;span style="vertical-align: baseline;"&gt;Cloud Trace&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; provides the ability to capture and visualize the entire request lifecycle, from the initial HTTP request through a cascade of cross-services calls, sub-agent invocations, and LLM calls, to the final response.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;span style="vertical-align: baseline;"&gt;Connect with Us&lt;/span&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;Vlad Kolesnikov → &lt;/span&gt;&lt;a href="https://www.linkedin.com/in/vkolesnikov/" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Linkedin&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;, &lt;/span&gt;&lt;a href="https://x.com/vladkol" rel="noopener" target="_blank"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;X&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;</description><pubDate>Fri, 27 Feb 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/from-vibe-checks-to-continuous-evaluation-engineering-reliable-ai-agents/</guid><category>Developers &amp; Practitioners</category><media:content height="540" url="https://storage.googleapis.com/gweb-cloudblog-publish/images/vibe-check-hero.max-600x600.jpg" width="540"></media:content><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>From "Vibe Checks" to Continuous Evaluation: Engineering Reliable AI Agents</title><description></description><image>https://storage.googleapis.com/gweb-cloudblog-publish/images/vibe-check-hero.max-600x600.jpg</image><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/from-vibe-checks-to-continuous-evaluation-engineering-reliable-ai-agents/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Vlad Kolesnikov</name><title>Developer Relations Engineer</title><department></department><company></company></author></item><item><title>Give your agentic chatbots a fast and reliable long-term memory</title><link>https://cloud.google.com/blog/topics/developers-practitioners/improve-chatbot-memory-using-google-cloud/</link><description>&lt;div class="block-paragraph_advanced"&gt;&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;When scaling conversational agents, the data layer design often determines success or failure. To support millions of users, agents need conversational continuity — the ability to maintain responsive chats while preserving the context backend models need.&lt;/span&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;This article covers how to use Google Cloud solutions to solve two data challenges in AI: fast context updates for real-time chat, and efficient retrieval for long-term history. We’ll share a polyglot approach using Redis, Bigtable, and BigQuery that ensures your agent retains detail and continuity, from recent interactions to months-old archives.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Polyglot storage approach for short, mid, and long-term history&lt;/span&gt;&lt;/h3&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/1_-_Polyglot_Persistence_Layer.max-1000x1000.png"
        
          alt="1 - Polyglot Persistence Layer"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;What is a polyglot approach?&lt;/span&gt;&lt;/h3&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;A polyglot approach uses a multi-tiered storage strategy that leverages several specialized data services rather than a single database to manage different data lifecycles. This allows an application to use the specific strengths of various tools—such as in-memory caches for speed, NoSQL databases for scale, blob storage for unstructured artifacts, and data warehousing for analytics—to handle the "temperature" and volume of data effectively.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;Define a polyglot approach on Google Cloud for short, mid, and long-term memory&lt;/span&gt;&lt;/h3&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;To maintain conversational continuity, you can implement this polyglot approach using Memorystore for Redis for sub-millisecond "hot" context retrieval, Cloud Bigtable as a petabyte-scale system of record for durable history, and BigQuery for long-term archival and analytical insights, with Cloud Storage handling unstructured multimedia and an asynchronous pipeline built using Pub/Sub and Dataflow.&lt;/span&gt;&lt;/p&gt;
&lt;h4 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;1. Short-term memory: Memorystore for Redis&lt;/span&gt;&lt;/h4&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Users expect chat histories to load instantaneously, whether they are initiating a new chat or continuing a previous conversation. For context of a conversation, &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Memorystore for Redis&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; serves as the primary cache. As a fully managed in-memory data store, it provides the sub-millisecond latency required to maintain a natural conversational flow. Since chat sessions are incrementally growing lists of messages, we store history using &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Redis Lists&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;. By using the native &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;RPUSH&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; command, the application transmits only the newest message, avoiding the network-heavy "read-modify-write" cycles found in simpler stores like Memcached.&lt;/span&gt;&lt;/p&gt;
&lt;h4 role="presentation"&gt;&lt;span style="vertical-align: baseline;"&gt;2. Mid-term memory: Cloud Bigtable&lt;/span&gt;&lt;/h4&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;As the conversation grows over time, the agentic applications need to account for larger and longer term storage of a growing chat history. This is where &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Bigtable&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; acts as the durable mid-term store and the definitive system of record for all chat history. Bigtable is a petabyte-scale NoSQL database designed specifically for high-velocity, write-heavy workloads, making it perfect for capturing millions of simultaneous chat interactions. While it handles massive data volumes, teams can keep the active cluster lean by implementing garbage collection policies&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;— retaining, for example, only the last 60 days of data in the high-performance tier. To make lookups fast, we use a key strategy with a &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;user_id#session_id#reverse_timestamp&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt; pattern. This co-locates all messages from a single session, allowing for efficient range scans to retrieve the most recent messages for history reloads.&lt;/span&gt;&lt;/p&gt;
&lt;h4 role="presentation" style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;3. Long-term memory and analytics: BigQuery&lt;/span&gt;&lt;/h4&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;For archival and analytics, data moves to &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;BigQuery&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, representing the long-term memory of the system. While Bigtable is optimized for serving the live application, BigQuery is Google's premier serverless data warehouse designed for complex SQL queries at scale. This allows teams to go beyond simple logging and derive analytical insights. Ultimately, this operational data becomes a feedback loop for improving the agent and user experience without impacting the performance of the user-facing components.&lt;/span&gt;&lt;/p&gt;
&lt;h4 role="presentation" style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;4. Artifact storage: Cloud Storage (GCS)&lt;/span&gt;&lt;/h4&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Unstructured data such as multimedia files — whether uploaded by a user for analysis or generated by a generative model — live in &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;Cloud Storage&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, which is purpose built for unstructured artifacts. We utilize a &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;pointer strategy&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; where Redis and Bigtable records contain a &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;URI pointer&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt; (e.g., &lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;gs://bucket/file&lt;/span&gt;&lt;span style="vertical-align: baseline;"&gt;) to the object. To maintain security, the application serves these files using &lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt;signed URLs&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;, providing the client with time-limited access without exposing the bucket publicly.&lt;/span&gt;&lt;/p&gt;
&lt;h3&gt;&lt;span style="vertical-align: baseline;"&gt;A hybrid sync-async strategy for optimal flow of data&lt;/span&gt;&lt;/h3&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;As shown in the sequence diagrams below, the hybrid sync-async strategy utilizes the abovementioned storage solutions to balance high-speed consistency with durable data persistence.&lt;/span&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;The diagram below shows how a user message and corresponding agent response traverse through the architecture:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/2_-_Sequence_Diagram.max-1000x1000.png"
        
          alt="2 - Sequence Diagram"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;p&gt;&lt;span style="vertical-align: baseline;"&gt;The diagram below shows how data flows across the architecture when a user decides to retrieve chat history for a particular session:&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;
&lt;div class="block-image_full_width"&gt;






  
    &lt;div class="article-module h-c-page"&gt;
      &lt;div class="h-c-grid"&gt;
  

    &lt;figure class="article-image--large
      
      
        h-c-grid__col
        h-c-grid__col--6 h-c-grid__col--offset-3
        
        
      "
      &gt;

      
      
        
        &lt;img
            src="https://storage.googleapis.com/gweb-cloudblog-publish/images/3_-_History_Seq_Diagram.max-1000x1000.png"
        
          alt="3 - History Seq Diagram"&gt;
        
        &lt;/a&gt;
      
    &lt;/figure&gt;

  
      &lt;/div&gt;
    &lt;/div&gt;
  




&lt;/div&gt;
&lt;div class="block-paragraph_advanced"&gt;&lt;h3 style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Start building now&lt;/span&gt;&lt;/h3&gt;
&lt;p style="text-align: justify;"&gt;&lt;span style="vertical-align: baseline;"&gt;Ready to build an agent with a robust persistence layer?&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Build agents quickly&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Start prototyping your agentic workflows on&lt;/span&gt; &lt;a href="https://cloud.google.com/products/agent-builder"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Vertex AI Agent Builder&lt;/span&gt;&lt;/a&gt;&lt;strong style="vertical-align: baseline;"&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Configure your cache&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Determine which&lt;/span&gt; &lt;a href="https://cloud.google.com/memorystore/docs/redis/redis-overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Memorystore for Redis configuration&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; best suits your latency and availability needs.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Design a robust BigTable schema&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Review the schema design &lt;/span&gt;&lt;a href="https://docs.cloud.google.com/bigtable/docs/schema-design"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;best practices&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li aria-level="1" style="list-style-type: disc; vertical-align: baseline;"&gt;
&lt;p role="presentation" style="text-align: justify;"&gt;&lt;strong style="vertical-align: baseline;"&gt;Bridge to analytics&lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;: Use the&lt;/span&gt; &lt;a href="https://docs.cloud.google.com/bigtable/docs/change-streams-to-bigquery-quickstart"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Bigtable change stream to BigQuery template&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to ready your live chat logs for actionable business insights.&lt;/span&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong style="vertical-align: baseline;"&gt;Bring data to life with analytics: &lt;/strong&gt;&lt;span style="vertical-align: baseline;"&gt;Use&lt;/span&gt;&lt;strong style="vertical-align: baseline;"&gt; &lt;/strong&gt;&lt;a href="https://docs.cloud.google.com/looker/docs/conversational-analytics-overview"&gt;&lt;span style="text-decoration: underline; vertical-align: baseline;"&gt;Looker Conversational Analytics&lt;/span&gt;&lt;/a&gt;&lt;span style="vertical-align: baseline;"&gt; to drive product decisions through business intelligence.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;</description><pubDate>Fri, 27 Feb 2026 17:00:00 +0000</pubDate><guid>https://cloud.google.com/blog/topics/developers-practitioners/improve-chatbot-memory-using-google-cloud/</guid><category>Developers &amp; Practitioners</category><og xmlns:og="http://ogp.me/ns#"><type>article</type><title>Give your agentic chatbots a fast and reliable long-term memory</title><description></description><site_name>Google</site_name><url>https://cloud.google.com/blog/topics/developers-practitioners/improve-chatbot-memory-using-google-cloud/</url></og><author xmlns:author="http://www.w3.org/2005/Atom"><name>Aishwarya Prabhat</name><title>AI Solutions Acceleration Architect</title><department></department><company></company></author><author xmlns:author="http://www.w3.org/2005/Atom"><name>Yun Pang</name><title>Principal Architect</title><department></department><company></company></author></item></channel></rss>