Serverless

AI Studio unlocks full-stack vibe coding with Cloud Run, Firebase, and Cloud SQL, no credit card required

Thu, 21 May 2026 16:00:00 +0000

At Google I/O 2026, we announced updates to the integration between Google AI Studio and Google Cloud:

New users can deploy up to two full-stack applications to the Google Cloud Starter Tier, no billing account required
An expanded choice of databases: Firestore for non-relational data, and Cloud SQL as a new relational database option
Tight integration with Google Workspace tools like Sheets, Calendar, and Gmail using Firebase Auth as the single user login flow

This is an update to the integration we announced in March, which included support for vibe-coded full-stack app deployments from AI Studio powered by Cloud Run, Firestore, and Firebase Auth.

With this expanded integration, you can use AI Studio to build a broader set of applications, using either a relational database with Cloud SQL or a non-relational database with Firestore. You don’t even need to specify a database — the AI agent can infer the right database for your app or feature.

Get started today in AI Studio at no cost with Cloud Run, Cloud SQL for PostgreSQL (coming next month), Firestore, and Firebase Auth for Starter Tier.

Publishing a full-stack app from AI Studio to Cloud Run with a single click

An easy on-ramp: The Google Cloud Starter Tier

You can build applications in AI Studio and deploy your prototypes directly to Cloud Run, authenticate via Firebase Auth, and store your data in a Firestore or Cloud SQL database. No credit card, no Google Cloud account, no friction — just prompt and launch.

If you don’t have an account, AI Studio uses the Google Cloud Starter Tier to create resources for you. You can deploy up to two full-stack apps. If you outgrow the limits of the Starter Tier, you can upgrade to a standard Google Cloud project with a billing account. All your resources will be transferred to your billable Google Cloud project, so that your application can scale as it grows.

Powering full-stack vibe coding with Cloud SQL

We’re introducing an intelligent, automated data foundation that makes it easy for developers to focus on their applications, not their infrastructure.

AI Studio integration with Cloud SQL includes:

An instant on-ramp: Go from prompt to a fully-deployed PostgreSQL database rapidly with instant provisioning.
Zero-cost startup: Try Cloud SQL for the Google Cloud Starter Tier at no cost, without needing a credit card or Google Cloud account.
Flexible cost control: The AI agent uses a new Cloud SQL for PostgreSQL developer edition, which enables the backend to scale to zero automatically, so you only pay while you’re using the app.
Agent-driven experience: To update your application, enter new prompts and the AI Agent automatically creates the schema and executes SQL statements in the database.
Global scalability: While the interface is simple, your app runs on Google Cloud’s robust, highly-reliable, and securely designed infrastructure that can scale to support millions of users.

Creating an app powered by Cloud SQL for PostgreSQL developer edition

Full-stack vibe coding with Firestore and Firebase Auth

When you’re building an app in AI Studio, the agent proactively detects if you need data storage and authentication based on your prompt, and offers to set up a database and user authentication. For apps that benefit from a document database, the agent shows a card to turn on Firestore and Firebase Authentication with your approval.

Enable Firebase for your application when prompted by the agent

By clicking “Enable Firebase,” the agent automatically:

Provisions Firestore, enables authentication, and connects your app to the database
Creates your web app’s sign-in page and configures authentication with Google Sign In
Generates the Firestore code in your app so you can sync data across sessions and devices
Drafts and deploys Firestore Security Rules based on your app’s logic (but you should always double-check these rules before sharing or deploying your app!)

With Firebase Auth, you can:

Connect your apps to Google Workspace using natural language: When you ask for a feature involving Workspace (e.g. Sheets, Calendar, Gmail), the agent implements a “Sign in with Google” flow, powered by Firebase Authentication, designed to securely grant Google AI Studio access to your data.

Connect your app to Google Sheets, powered by Firebase Authentication

Check out more details on the What’s New from Firebase at Google I/O blog.

Getting started in AI Studio

Going from idea to app is now a reality. You can build a full-stack application at no cost using the following steps:

Log into AI Studio: Access the platform to begin your project.
Build with prompts: Start building your application using natural language prompts. For example, “Build an expense tracker app.”
Enable the database: Prompt “Add a database” and AI Studio intelligently provisions a database through an "Enable" widget. You can explicitly ask for a relational database if you’d like to make your preference clear.
Set up the system: Select “Enable” and agree to the terms.
Start sharing: Deploy and share the application through the “Publish” button.

Get started today in AI Studio to turn your ideas into live applications in seconds.

What’s new in Cloud Run at Next ‘26

Wed, 22 Apr 2026 12:00:00 +0000

From vibe-coded and large-scale apps to AI models and agents, Cloud Run delivers on-demand compute with zero overhead and pay-per-use pricing for all of your workloads. Last year, the number of external active developers and applications on Cloud Run doubled, with more new customers and apps coming to Cloud Run in 2025 than in its first 6 years combined!

Today, we’re announcing new features and improvements to Cloud Run to help you run your workloads:

Build and deploy full-stack apps in Google AI Studio with Cloud Run, Firestore, and user authentication.
Build, scale, govern, and optimize reliable AI agents with the all new Gemini Enterprise Agent Platform and Cloud Run.
Enable easy deployments from developers and agents with Cloud Run's fully managed remote MCP server.
Combine high-performance inference and serverless compute with NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs on Cloud Run.

Empowering the new era of developers

For decades, software development has had an inaccessible learning curve, but thanks to AI, anyone can be a digital builder. With Cloud Run, you can go from prototype to deployed app in seconds.

Build full-stack apps in Google AI Studio
AI Studio now supports full-stack applications that can run server-side code, a Firestore database, and user authentication. Deploy your vibe-coded apps with a single click to Cloud Run, now generally available.

Cloud Run's fully managed remote MCP server
To make it even easier for developers or agents to deploy code, we are launching an official remote Cloud Run MCP (Model Context Protocol) server, giving you the tools to manage and deploy apps. Now GA.

Billing caps
Soon, you’ll be able to define your maximum spend per month. If your bill reaches this amount, your Cloud Run resources will be de-activated.

"Cloud Run has been one of the best technical choices we made in our deployments platform. It is our primary target, powering us to over 1 million live projects being hosted on Replit." - Scott Kennedy, VP of Engineering, Replit

Embracing the agentic era

AI agents are just like people in that they need access to a compute environment to perform their tasks. For a cloud-based AI agent to take complex actions, it can use Cloud Run’s on-demand compute service.

Cloud Run integration with Gemini Enterprise Agent Platform
Through its integration with Cloud Run, Agent Platform helps agents transition from experimental environments into fully managed, production-grade systems without having to rebuild them. Now in preview with select customers.

Cloud Run instances
Traditionally, Cloud Run services, jobs, or worker pools have been opinionated ways to manage Cloud Run infrastructure. Now, we are giving you access to the underlying primitive: you can create individual Cloud Run instances. Coupled with Cloud Storage volume mounts, these instances are ideal for hosting long-running background agents like OpenClaw in one simple command:

code_block: <ListValue: [StructValue([('code', 'gcloud run instances create \\\r\n --image alpine/openclaw:latest \\\r\n --port 18789 \\\r\n --memory 4Gi \\\r\n --default-url \\\r\n --add-volume mount-path=/home/node/.openclaw,type=cloud-storage,bucket=$BUCKET_NAME'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe3faa60>)])]>

This functionality is available in preview to select customers.

Cloud Run sandboxes
Agents often need a safe place to execute code or other commands as quickly as possible. Coming soon, while processing a single request, you will be able to very quickly spin up an ephemeral sandbox that’s strictly isolated from your agent code using a built-in sandbox tool:

code_block: <ListValue: [StructValue([('code', 'app.post(\'/execute\', (req, res) => {\r\n const escapedCode = req.body.code.replace(/"/g, \'\\\\"\');\r\n\r\n exec(`sandbox do -- /usr/bin/python3 -c "${escapedCode}"`, (e, stdout, stderr) => {\r\n res.send({ stdout, stderr });\r\n });\r\n\r\n});'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe3faa00>)])]>

"Cloud Run’s concurrency model has been instrumental in simplifying our AI workloads for our customer service AI tool, Lumi. Using Cloud Run alongside Gemini and AlloyDB, we’ve created a unified action layer that enables real-time call summarization and flow guidance, leading to improved first-call resolution rates and faster onboarding for our contact center team." - Edward Wright, Head of Engineering, VirginMedia O2 UK

Automatic scaling for high-demand applications

Cloud Run automatically scales to meet your demand, making it a great fit for large customers that need to serve and respond instantly to heavy traffic spikes.

SSH support for Cloud Run
Developers can now gain secure shell access (SSH) directly into a running Cloud Run container, enabling advanced troubleshooting and inspecting the container's file system on the fly, now in preview with select customers. Open a secure interactive shell session with simple command:

code_block: <ListValue: [StructValue([('code', 'gcloud run services ssh SERVICE'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe3faf40>)])]>

Cloud Run service bindings
Coming soon, unlock seamless service-to-service communication for your scalable microservices architectures with Cloud Run service bindings.

“Cloud Run’s serverless architecture empowers us to meet exponentially growing demand through near-instant scaling, while its streamlined developer experience simplifies building and running our applications.” - Mimi Chen, Member of Technical Staff, Anthropic

Running AI models

From serving custom models with GPUs to training and fine-tuning models with jobs, you can use Cloud Run for your most AI-intensive workloads.

Support for NVIDIA RTX PRO 6000 Blackwell GPU on Cloud Run
We’re bringing the serverless experience to high-end inference with support for NVIDIA RTX PRO™ 6000 Blackwell GPUs on Cloud Run, now GA. This means you can serve up to 70B+ parameter models without having to manage any underlying infrastructure, including scaling to zero when the resource is not in use.

Ephemeral disk
With per-instance temporary disk storage, workloads can process large files or use scratch space without eating up your container memory. Now in preview, ephemeral disk storage is created when an instance starts and deleted when it stops.

"Cloud Run has fundamentally changed how we manage our model deployments. By moving to a usage-based, scale-to-zero model, we’ve eliminated idle GPU costs for low-traffic models. We are now running over 17 model variants in production across multiple regions, each independently deployable and isolated, without the burden of capacity planning or fleet management." - Ajay Nair, Global VP, Elastic

On-demand compute for every workload

Whether you’re a seasoned software developer or a vibe coder looking to deploy the next viral app, Cloud Run delivers on-demand compute for everyone and every workload. Get started today.

How Estée Lauder Companies uses Cloud Run worker pools for its pull-based agentic workloads

Thu, 09 Apr 2026 16:00:00 +0000

Cloud Run has long provided developers with a straightforward, opinionated platform for running code. You can easily deploy request-driven web applications using Cloud Run services, or execute run-to-completion batch processing with Cloud Run jobs. However, as developers build more complex applications, like pipelines that process continuous streams of data or distributed AI workloads, they need an environment designed for continuous, background execution.

Estée Lauder Companies got just that with Cloud Run worker pools, which transform Cloud Run from a platform for web workloads and background tasks, to a platform for pull-based workloads. Cloud Run worker pools are now generally available.

Estee Lauder Companies’ Rostrum platform is a polymorphic chat service for LLM-powered applications that originally ran as a standalone Cloud Run service. While the simple architecture worked for internal tools with predictable traffic, the team faced a major hurdle of the upcoming holiday shopping season for consumer-facing traffic. To launch their first consumer-facing generative AI application, Jo Malone London’s AI Scent Advisor, they needed an architecture that would sustain the load of AI prompts from thousands of simultaneous users.

In just a few weeks, Estee Lauder Companies migrated to a producer-consumer model using Cloud Run worker pools. The web tier, a FastAPI application deployed as Cloud Run Service acts as the producer, instantly publishing user messages to Cloud Pub/Sub. The worker pools deployments act as “always-on” consumers, pulling messages from the queue to handle LLM inference.

By decoupling the user-facing web tier from LLM operations, Estee Lauder Companies achieved:

100% message durability: Pub/sub acts as a buffer such that even during holiday spikes, no user message is lost.
Strong UI latency SLAs: Server-side rendering is decoupled from message processing load.
Minimal operations overhead: The team spent virtually no time managing servers, allowing them to focus on the user experience rather than infrastructure.

This modular architecture now serves as the blueprint for Estee Lauder Companies to rapidly launch specialized AI advisors across its diverse house of brands.

"The Jo Malone London AI Scent Advisor chains multiple LLM and tool calls — conversational discovery, deterministic scoring, copy generation — in a pipeline that had to run reliably at consumer scale without us managing infrastructure. Cloud Run worker pools was exactly the right primitive, and working directly with the product team as early adopters gave us the confidence to build on it ahead of GA. It's now the foundation for us to bring AI advisors to brands across the Estée Lauder Companies portfolio." - Chris Curro, Principal Machine Learning Engineer, The Estée Lauder Companies

Serverless for pull-based and distributed workloads

Traditional serverless models often force background work into an HTTP push format, which can lead to timeouts, overscaling, or message loss during traffic surges. Cloud Run worker pools solve this by providing an always-on environment where the worker pool instances pull tasks or messages from a queue at their own pace, providing built-in backpressure that protects your infrastructure from crashing under load.

Unlike Cloud Run services, worker pools are designed for workloads requiring non-HTTP protocols. When a worker pool is attached to a VPC network, every instance receives a private IP address. This enables high-performance L4 ingress, allowing you to host services previously incompatible with the Google Cloud serverless platform.

With the GA of worker pools, Cloud Run supports major new categories of workloads:

Pull-based workloads: Worker pools provide a reliable environment for running and scaling workloads that continuously pull messages from queues like Pub/Sub, Kafka, Github Runners or Redis task queues.
Distributed AI/ML workloads: Worker pools are a great fit for distributed LLM training or fine-tuning workloads. At GA, worker pools support NVIDIA L4 and RTX PRO 6000 (Blackwell) GPUs.

One of the most significant advantages of this new offering is its cost-efficiency, as worker pools can be approximately 40% cheaper than request-driven Services or Jobs for long-running background tasks.

Scaling pull-based workloads using Cloud Run External Metrics Autoscaler (CREMA)

Worker pools run a set of instances that do background work, but they still need a signal to scale. To bridge this gap, we recently built, and open-sourced, Cloud Run External Metrics Autoscaler (CREMA).

CREMA uses KEDA's library of scalers – including Kafka, Pub/sub, Github Actions and Prometheus – to automatically scale your instances based on metrics emitted by these external sources. By smoothly handling traffic surges and scaling back to zero during idle periods, CREMA ensures you optimize both performance and cost

To start scaling, all you need to do is deploy CREMA as a Cloud Run service, and then define your scaling logic in a single YAML configuration file that instructs CREMA which external sources to monitor and which worker pool to scale.

Here is an example of what it looks like to automatically scale a worker pool based on GitHub Runner queue depth:

code_block: <ListValue: [StructValue([('code', 'apiVersion: crema/v1\r\nkind: CremaConfig\r\nmetadata:\r\n name: gh-demo\r\nspec:\r\n scaledObjects:\r\n - spec:\r\n scaleTargetRef:\r\n name: projects/example-project/locations/us-central1/workerpools/example-workerpool\r\n triggers:\r\n - type: github-runner\r\n metadata:\r\n owner: repo-owner\r\n runnerScope: repo\r\n repos: repo-name\r\n targetWorkflowQueueLength: 1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe7aa610>)])]>

Get started

You can deploy your first worker pool today by referring to the documentation. To implement advanced, queue-aware scaling, explore the CREMA open-source repository to connect your workloads to KEDA-supported scalers.

To implement high-performance distributed workloads using Cloud Run worker pools and External Metrics Autoscaling (CREMA), you can refer to the below examples for the use case of your choice.

Simplify your Cloud Run security with Identity Aware Proxy (IAP)

Fri, 13 Mar 2026 16:00:00 +0000

Cloud Run provides a powerful and scalable platform for deploying applications. Today, we’re introducing the general availability of two major enhancements to Cloud Run security: direct Identity-Aware Proxy (IAP) integration, and a way to allow public access to Cloud Run services that is compatible with Domain Restricted Sharing (DRS).

Introducing direct IAP on Cloud Run

IAP lets you easily control user access to applications running in Google Cloud. Integrating IAP with Cloud Run previously required you to manually configure application load balancers and other complex network settings. This added operational overhead detracted from Cloud Run's core promise of serverless simplicity.

That changes today! You can now enable IAP directly on Cloud Run in a single click, with no load balancers, and at no added cost. Google Cloud does not charge for IAP (with some exceptions), and it incurs no load balancer costs.

Enable IAP authentication directly on a Cloud Run service

Why this matters:

Simplified enablement: Turn on IAP in the UI or with a single flag (--iap) through gcloud, significantly simplifying deployments and saving valuable time and effort.
Enterprise-grade security for all web apps: Use IAP’s authentication and authorization policies based on user or group identities, as well as context-aware factors like IP address, geolocation, and device security status.
Support for Workforce Identity Federation: Easily manage access for your employees and partners using your existing identity providers.
Simplified Cross-Origin Resource Sharing (CORS): Configure IAP directly on Cloud Run to allow unauthenticated HTTP OPTIONS for CORS requests. This helps satisfy browser preflight checks while ensuring all other requests undergo authentication.

We are already seeing a big uptake in organizations adopting IAP to secure Cloud Run workloads, for example, at L’Oreal.

“L'Oréal relies on Google Cloud's Identity-Aware Proxy (IAP) as a critical layer of security, ensuring that access to every web application we host on Google Cloud is meticulously filtered and controlled. The beauty of IAP lies in its simplicity and effectiveness; it's a self-managed solution that's not only free but also exceptionally straightforward to implement across our diverse application landscape. This ease of deployment, combined with a security posture that surpasses what we could achieve with custom-built solutions, makes IAP an indispensable tool for protecting our digital assets.” - Antoine Castex, Group Data & A.I Architect, L'Oréal

Allow public access when using DRS

New simplified Cloud Run authentication UI

While IAP is the recommended authentication mechanism for internal business applications on Cloud Run, Cloud IAM remains essential for managing service-to-service communication.

Historically, Cloud Run's default behavior was to perform an IAM check (run.invoker role) on every request to an HTTPS endpoint. While this provided a strong security baseline, it had the potential to become a bottleneck when the intent was to create public apps, particularly when organizations also enforced the Domain Restricted Sharing policy.

You can now disable this IAM "invoker" check by selecting “Allow Public access” for your applications.

This gives you flexibility to rely on other security layers like organization policies, network-level controls, or custom authn/authz for your services. It also unlocks broader use cases:

Public websites: Host a store locator site on Cloud Run and make it accessible to everyone — even if your Org Policy restricts sharing (DRS enabled). You can do this by selecting “Allow Public access” and setting ingress to ‘All’.
Private microservices: For services behind an internal ingress where network-level security is sufficient, you can bypass the IAM check by selecting “Allow Public access”.

“Bilt leverages the 'disable IAM' feature for multiple mission-critical Cloud Run services deployed in multi-regional topologies. By disabling IAM on these instances, we establish a direct, unimpeded path from our edge, while maintaining security using Cloud Armor on the global load balancer. This simplified approach reduces infrastructure complexity and provides a more performant solution while maintaining org-wide security posture through organizational policies.” - Kosta Krauth, CTO Bilt

Getting started

Ready to get started? You can easily enable IAP directly on Cloud Run.

Learn more:

High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run

Mon, 02 Feb 2026 17:00:00 +0000

Running large-scale inference models can involve significant operational toil, including cluster management and manual VM maintenance. One solution is to leverage a serverless compute platform to abstract away the underlying infrastructure. Today, we’re bringing the serverless experience to high-end inference with support for NVIDIA RTX PRO™ 6000 Blackwell Server Edition GPUs on Cloud Run. Now in preview, you can deploy massive models like Gemma 3 27B or Llama 3.1 70B with the 'deploy and forget' experience you’ve come to expect from Cloud Run. No reservations. No cluster management. Just code.

A powerful GPU platform

The NVIDIA RTX PRO 6000 Blackwell GPU provides a huge leap in performance compared to the NVIDIA L4 GPU, bringing 96GB vGPU memory, 1.6 TB/s of bandwidth and support for FP4 and FP6. This means you can serve up to 70B+ parameter models without having to manage any underlying infrastructure. Cloud Run lets you attach a NVIDIA RTX PRO 6000 Blackwell GPU to your Cloud Run service, job, or worker pools, on demand, with no reservations required. Here are some ways you can use the NVIDIA RTX PRO 6000 Blackwell GPU to accelerate your business:

Generative AI and inference: With its FP4 precision support, the NVIDIA RTX PRO 6000 Blackwell GPU’s high-efficiency compute accelerates LLM fine-tuning and inference, letting you create real-time generative AI applications such as multi-modal and text-to-image creation models. By running your model on Cloud Run services, you can also take advantage of rapid startup and scaling, going from zero instances to having a GPU with drivers installed under 5 seconds. When traffic eventually scales down zero and no more requests are being received, Cloud Run automatically scales your GPU instances down to zero.
Fine-tuning and offline inference: NVIDIA RTX PRO 6000 Blackwell GPUs can be used in conjunction with Cloud Run jobs to fine-tune your model. The fifth-generation NVIDIA Tensor Cores can be used in conjunction with AI models to help accelerate rendering pipelines and enhance content creation.
Tailored scaling for specialized workloads: Use GPU-enabled worker pools to apply granular control over your GPU workers, whether you need to dynamically scale based on custom external metrics or manually provision "always-on" instances for complex, stateful processing.

We built Cloud Run to be the simplest way to run production-ready, GPU-accelerated tasks. Some highlights of Cloud Run include:

Managed GPUs with flexible compute: Cloud Run pre-installs the necessary NVIDIA drivers so you can focus on your code. Cloud Run instances using NVIDIA RTX PRO 6000 Blackwell GPUs can configure up to 44 vCPU and 176GB of RAM.
Production-grade reliability: By default, Cloud Run offers zonal redundancy, helping to ensure enough capacity for your service to be resilient to a zonal outage; this also applies to Cloud Run with GPUs. Alternatively, you can turn off zonal redundancy and benefit from a lower price for best-effort failover of your GPU workloads in case of a zonal outage.
Tight integration: Cloud Run works natively with the rest of Google Cloud. You can load massive model weights by mounting Cloud Storage buckets as local volumes, or use Identity-Aware Proxy (IAP) to secure traffic that’s bound for a Cloud Run service.

Get started

The NVIDIA RTX PRO 6000 Blackwell GPU is available in preview on demand with availability in us-central1 and europe-west4, and limited availability in asia-south2 and asia-southeast1. You can deploy your first service using Ollama, one of the easiest way to run open models, on Cloud Run with NVIDIA RTX PRO 6000 GPUs enabled:

code_block: <ListValue: [StructValue([('code', 'gcloud beta run deploy my-service \\\r\n--image ollama/ollama --port 11434 \\\r\n--cpu 20 --memory 80Gi \\\r\n--gpu-type nvidia-rtx-pro-6000 \\\r\n--no-gpu-zonal-redundancy \\\r\n--region us-central1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe1435b0>)])]>

For more details, check out our updated Cloud Run documentation and AI inference best practices.

Elevate your applications with Firestore’s new advanced query engine

Tue, 20 Jan 2026 17:00:00 +0000

A hallmark of a valuable database is how easy it is to query the data inside, so that developers can build tailored and complex user experiences in an application. Last week marked a significant evolution for Firestore, Google Cloud’s enterprise-grade, scalable document database, with the debut of an advanced query engine designed to help you build more sophisticated applications.

Available as part of Firestore in Native mode, this powerful engine introduces over a hundred new query capabilities, called pipeline operations, available in preview, which streamline complex queries directly within the database. Alongside this, we're launching precise indexing controls and refreshed observability tools like query explain and query insights, giving you granular control over performance. All these robust capabilities are now available in the Firestore Enterprise edition, which also offers a more transparent pricing model for potential cost savings. This is all in the service of building highly expressive, performant applications that can query, transform and filter data across many dimensions, with less operational overhead. At the same time, you’re benefiting from Firestore’s unique serverless foundation, multi-region replication, and virtually unlimited scalability, freeing you from database management complexities, so you can truly focus on innovation.

Used by a vibrant community of over 600,000 developers, Firestore has long been appreciated for its simplicity. In 2019, Firestore Standard edition in Native mode streamlined the development of collaborative applications with a straightforward query interface that guaranteed high performance through the use of automatically generated indexes. However, this simplified query engine has a strong dependence on indexing for query execution, often demanding upfront planning throughout the application lifecycle. Now, with the introduction of the advanced query engine in Enterprise edition, developers can construct highly expressive applications, regardless of the explicit presence of indexes — particularly for demanding solutions like e-commerce, interactive gaming, content management, and sophisticated user personalization. The refined query engine makes it easier to create pipeline operations, complete with sophisticated new stages and expressions, including support for complex aggregations, querying directly over arrays, advanced string matching capabilities, and granular filtering options.

A new query engine and pipeline operations experience

To enable this, we’ve updated Firestore’s existing SDKs with expanded support for pipeline operations. Now, you can elegantly chain together numerous stages for essential tasks such as aggregations, grouping, and filtering. Queries now run without mandatory indexes, giving you complete autonomy over when you want to create indexes to optimize performance. Let’s take a look at an illustration of pipeline operations.

Note: This example assumes you're familiar with Firestore's data model and existing query methods.

Suppose you want to identify the top trending hashtags on an existing food recipe application that allows users to add hashtags to recipes. For essential data (like the recipe text itself), you might represent a recipe as a document with some fields. Since a hashtag can be represented with just a string, you could add hashtags directly to the recipe document as an array of strings:

code_block: <ListValue: [StructValue([('code', '{\r\n title: "My recipe",\r\n instructions: "Cook the ingredients",\r\n authorId: "SomeAuthorID",\r\n hashtags: ["easy", "high protein", "low carb"],\r\n ...\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe77c160>)])]>

Firestore users can query for specific hashtags within recipes using existing core operations. However, prior to pipeline operations, there was no direct way to extract and aggregate array data from within a document during a query.

With pipeline operations, you can “unnest” arrays directly. This makes it simple to identify and suggest trending hashtags to your users. Below is an example of how to implement this using Javascript:

code_block: <ListValue: [StructValue([('code', '// Fetch 10 hashtags sorted by most popular.\r\nconst snapshot = await db.pipeline()\r\n\r\n // Starting with the collection of recipe documents:\r\n .collection("recipes")\r\n\r\n // Limit the document to just the `hashtags` field.\r\n .select("hashtags")\r\n\r\n // Unnest each tag within the `hashtags` array to its own document.\r\n .unnest(field("hashtags").as("tagName"))\r\n\r\n // Count the number of instances of each tag across recipes and\r\n // consolidate documents sharing a tagName into a single document\r\n // per tagName.\r\n .aggregate({\r\n accumulators: [countAll().as("tagCount")],\r\n groups: ["tagName"]\r\n })\r\n\r\n // Sort the resulting hashtags by their count.\r\n .sort(field("tagCount").descending())\r\n\r\n // Limit query results to just the top ten hashtags.\r\n .limit(10)\r\n\r\n .execute()'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe77c3a0>)])]>

In addition, Firestore Enterprise edition supports a broader array of index types (including single field, composite, sparse, non-sparse, and unique indexes) that lets you maximize query performance even more. Furthermore, you can control when indexes are created, consequently improving overall write performance and storage utilization when compared to Standard edition’s automatic single-field indexes. This helps to mitigate index fanout during write operations.

Since indexing is fully customizable, Enterprise edition also provides advanced observability tools — query explain and query insights — specifically built to help developers identify and optimize queries by identifying missing indexes. Through query explain, developers can profile queries to gain a comprehensive understanding of query planner details and view execution statistics. This includes essential data such as billing information, and deep, system-level visibility into the query's execution path.

Determine if a query is using an index and analyze its total execution metrics by profiling it with query explain.

Complementing this, query insights enables ongoing monitoring of high-latency and frequently executed queries that may require tuning. By utilizing the query insights dashboard, you can identify queries that can benefit from deploying indexes to boost performance.

Leverage query insights to identify the highest latency and most frequently executed queries on your database, evaluating whether they require indexing based on the quantity of index entries scanned.

Migration for current Firestore customers

If you’re new to Firestore, getting started is easy — simply create a Firestore Enterprise edition database. For existing Firestore developers, transitioning to Firestore pipeline operations is also simple: just use the integrated import and export service to migrate data from a Firestore Standard edition database to a freshly provisioned Enterprise edition database. Crucially, Enterprise edition maintains backwards compatibility, so you can retain your existing application code for Firestore core operations. When the time is right to harness the advanced capabilities, here’s how to convert code from core operations into pipeline operations:

code_block: <ListValue: [StructValue([('code', 'const query = db.collection("recipes").where("authorId", "==", user.id);\r\n\r\n// Convert the query into a pipeline\r\nconst pipeline = db.pipeline.createFrom(query);'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe77c550>)])]>

And then you can immediately begin working with the new pipeline capabilities.

code_block: <ListValue: [StructValue([('code', '// From the last snippet\r\nconst pipeline = db.pipeline.createFrom(query);\r\n\r\nconst snapshot = pipeline\r\n .where(field("rating").greaterThan(4))\r\n .execute();'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe77cb80>)])]>

Predictable pricing and optimized costs

Firestore Enterprise edition utilizes an improved, transparent pricing model for managing costs. For all read and write operations performed against the database, you are now billed based on the size of the documents and associated index entries involved. This new approach brings potential savings of up to 86% when executing read operations on documents under 4 kibibytes.

Real-time listen query updates are separately metered and billed as they are incurred. Furthermore, there are no upfront fees or latent costs resulting from incorrect database cluster capacity planning or inefficient database sharding. Storage consumption is billed solely for the actual capacity you use, inclusive of replicated copies for high availability. And if you’re new to Firestore and want to try it out, Enterprise edition includes access to a generous free-tier to make it easy to get started.

Get started with Firestore pipeline operations

Enterprise edition offers an advanced query engine to power flexible developer experiences, accessible both through Firestore in Native mode, and Firestore with MongoDB compatibility mode. This allows developers to maximize existing libraries and tools from either the Firestore and MongoDB developer communities. You can get started with Firestore pipeline operations in preview today, by creating a new Firestore Enterprise edition database in Native mode. To delve into how to get started with pipeline operations, refer to the documentation.

Embark on your journey with the Enterprise edition today — benefit from zero upfront fees and immediate access to a generous free-tier: https://cloud.google.com/products/firestore.

Responding to CVE-2025-55182: Secure your React and Next.js workloads

Wed, 03 Dec 2025 23:00:00 +0000

Editor's note: This blog was updated on Dec. 4, 5, 7, and 12, 2025, with additional guidance on Cloud Armor WAF rule syntax, and WAF enforcement across App Engine Standard, Cloud Functions, and Cloud Run.

Earlier today, Meta and Vercel publicly disclosed two vulnerabilities that expose services built using the popular open-source frameworks React Server Components (CVE-2025-55182) and Next.js to remote code execution risks when used for some server-side use cases. At Google Cloud, we understand the severity of these vulnerabilities, also known as React2Shell, and our security teams have shared their recommendations to help our customers take immediate, decisive action to secure their applications.

Vulnerability background

The React Server Components framework is commonly used for building user interfaces. On Dec. 3, 2025, CVE.org assigned this vulnerability as CVE-2025-55182. The official Common Vulnerability Scoring System (CVSS) base severity score has been determined as Critical, a severity of 10.0.

Vulnerable versions: React 19.0, 19.1.0, 19.1.1, and 19.2.0
Patched in React 19.2.1
Fix: https://github.com/facebook/react/commit/7dc903cd29dac55efb4424853fd0442fef3a8700
Announcement: https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components

Next.js is a web development framework that depends on React, and is also commonly used for building user interfaces. (The Next.js vulnerability was referenced as CVE-2025-66478 before being marked as a duplicate.)

Vulnerable versions: Next.js 15.x, Next.js 16.x, Next.js 14.3.0-canary.77 and later canary releases
Patched versions are listed here.
Fix: https://github.com/vercel/next.js/commit/6ef90ef49fd32171150b6f81d14708aa54cd07b2
Announcement: https://nextjs.org/blog/CVE-2025-66478

Google Threat Intelligence Group (GTIG) has also published a new report to help understand the specific threats exploiting React2Shell.

We strongly encourage organizations who manage environments relying on the React and Next.js frameworks to update to the latest version, and take the mitigation actions outlined below.

Mitigating CVE-2025-55182

We have created and rolled out a new Cloud Armor web application firewall (WAF) rule designed to detect and block exploitation attempts related to CVE-2025-55182. This new rule is available now and is intended to help protect your internet-facing applications and services that use global or regional Application Load Balancers. We recommend deploying this rule as a temporary mitigation while your vulnerability management program patches and verifies all vulnerable instances in your environment.

For customers using App Engine Standard, Cloud Functions, Cloud Run, Firebase Hosting or Firebase App Hosting, we provide an additional layer of defense for serverless workloads by automatically enforcing platform-level WAF rules that can detect and block the most common exploitation attempts related to CVE-2025-55182.

For Project Shield users, we have deployed WAF protections for all sites and no action is necessary to enable these WAF rules. For long-term mitigation, you will need to patch your origin servers as an essential step to eliminate the vulnerability (see additional guidance below).

Cloud Armor and the Application Load Balancer can be used to deliver and protect your applications and services regardless of whether they are deployed on Google Cloud, on-premises, or on another infrastructure provider. If you are not yet using Cloud Armor and the Application Load Balancer, please follow the guidance further down to get started.

While these platform-level rules and the optional Cloud Armor WAF rules (for services behind an Application Load Balancer) help mitigate the risk from exploits of the CVE, we continue to strongly recommend updating your application dependencies as the primary long-term mitigation.

Deploying the cve-canary WAF rule for Cloud Armor

To configure Cloud Armor to detect and protect from CVE-2025-55182, you can use the cve-canary preconfigured WAF rule leveraging the new ruleID that we have added for this vulnerability. This rule is opt-in only, and must be added to your policy even if you are already using the cve-canary rules.

In your Cloud Armor backend security policy, create a new rule and configure the following match condition:

code_block: <ListValue: [StructValue([('code', "(has(request.headers['next-action']) || has(request.headers['rsc-action-id']) || request.headers['content-type'].contains('multipart/form-data') || request.headers['content-type'].contains('application/x-www-form-urlencoded')) && evaluatePreconfiguredWaf('cve-canary',{'sensitivity': 0, 'opt_in_rule_ids': ['google-mrs-v202512-id000001-rce','google-mrs-v202512-id000002-rce']})"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe7bd430>)])]>

This can be accomplished from the Google Cloud console by navigating to Cloud Armor and modifying an existing or creating a new policy.

Cloud Armor rule creation in the Google Cloud console.

Alternatively, the gcloud CLI can be used to create or modify a policy with the requisite rule:

code_block: <ListValue: [StructValue([('code', 'gcloud compute security-policies rules create PRIORITY_NUMBER \\\r\n --security-policy SECURITY_POLICY_NAME \\\r\n --expression "(has(request.headers[\'next-action\']) || has(request.headers[\'rsc-action-id\']) || request.headers[\'content-type\'].contains(\'multipart/form-data\') || request.headers[\'content-type\'].contains(\'application/x-www-form-urlencoded\')) && evaluatePreconfiguredWaf(\'cve-canary\',{\'sensitivity\': 0, \'opt_in_rule_ids\': [\'google-mrs-v202512-id000001-rce\',\'google-mrs-v202512-id000002-rce\']})" \\\r\n --action=deny-403'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe7bd730>)])]>

Additionally, if you are managing your rules with Terraform, you may implement the rule via the following syntax:

code_block: <ListValue: [StructValue([('code', 'rule {\r\n action = "deny(403)"\r\n priority = "PRIORITY_NUMBER"\r\n match {\r\n expr {\r\n expression = "(has(request.headers[\'next-action\']) || has(request.headers[\'rsc-action-id\']) || request.headers[\'content-type\'].contains(\'multipart/form-data\') || request.headers[\'content-type\'].contains(\'application/x-www-form-urlencoded\')) && evaluatePreconfiguredWaf(\'cve-canary\',{\'sensitivity\': 0, \'opt_in_rule_ids\': [\'google-mrs-v202512-id000001-rce\',\'google-mrs-v202512-id000002-rce\']})"\r\n }\r\n }\r\n description = "Applies protection for CVE-2025-55182 (React/Next.JS)"\r\n }'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe7bdaf0>)])]>

Verifying WAF rule safety for your application and consuming telemetry

Cloud Armor rules can be configured in preview mode, a logging-only mode to test or monitor the expected impact of the rule without Cloud Armor enforcing the configured action. We recommend that the new rule described above first be deployed in preview mode in your production environments so that you can see what traffic it would block.

Once you verify that the new rule is behaving as desired in your environment, then you can disable preview mode to allow Cloud Armor to actively enforce it.

Cloud Armor per-request WAF logs are emitted as part of the Application Load Balancer logs to Cloud Logging. To see what Cloud Armor’s decision was on every request, load balancer logging first needs to be enabled on a per backend service basis. Once it is enabled, all subsequent Cloud Armor decisions will be logged and can be found in Cloud Logging by following these instructions.

Interaction of Cloud Armor rules with vulnerability scanning tools

There has been a proliferation of scanning tools designed to help identify vulnerable instances of React and Next.js in your environments. Many of those scanners are designed to identify the version number of relevant frameworks in your servers and do so by crafting a legitimate query and inspecting the response from the server to detect the version of React and Next.js that is running.

Our WAF rule is designed to detect and prevent exploit attempts of CVE-2025-55182. As the scanners discussed above are not attempting an exploit, but sending a safe query to elicit a response revealing indications of the version of the software, the above Cloud Armor rule will not detect or block such scanners.

If the findings of these scanners indicate a vulnerable instance of software protected by Cloud Armor, that does not mean that an actual exploit attempt of the vulnerability will successfully get through your Cloud Armor security policy. Instead, such findings mean that the version React or Next.js detected is known to be vulnerable and should be patched.

How to get started with Cloud Armor for new users

If your workload is already using an Application Load Balancer to receive traffic from the internet, you can configure Cloud Armor to protect your workload from this and other application-level vulnerabilities (as well as DDoS attacks) by following these instructions.

If you are not yet using an Application Load Balancer and Cloud Armor, you can get started with the external Application Load Balancer overview, the Cloud Armor overview, and the Cloud Armor best practices.

If your workload is using Cloud Run, Cloud Run functions, or App Engine and receives traffic from the internet, you must first set up an Application Load Balancer in front of your endpoint to leverage Cloud Armor security policies to protect your workload. You will then need to configure the appropriate controls to ensure that Cloud Armor and the Application Load Balancer can’t be bypassed.

Best practices and additional risk mitigations

Once you configure Cloud Armor, we recommend consulting our best practices guide. Be sure to account for limitations discussed in the documentation to minimize risk and optimize performance while ensuring the safety and availability of your workloads.

Serverless platform protections

Google Cloud is enforcing platform-level protections across App Engine Standard, Cloud Functions, and Cloud Run to automatically help protect against common exploit attempts of CVE-2025-55182. This protection supplements the protections already in place for Firebase Hosting and Firebase App Hosting.

What this means for you:

Applications deployed to those serverless services benefit from these WAF rules that are enabled by default to help provide a base level of protection without requiring manual configuration.
These rules are designed to block known malicious payloads targeting this vulnerability.

Important considerations:

Patching is still critical: These platform-level defenses are intended to be a temporary mitigation. The most effective long-term solution is to update your application's dependencies to non-vulnerable versions of React and Next.js, and redeploy them.
Potential impacts: While unlikely, if you believe this platform-level filtering is incorrectly impacting your application's traffic, please contact Google Cloud Support and reference issue number 465748820.

Long-term mitigation: Mandatory framework update and redeployment

While WAF rules provide critical frontline defense, the most comprehensive long-term solution is to patch the underlying frameworks.

While Google Cloud is providing platform-level protections and Cloud Armor options, we urge all customers running React and Next.js applications on Google Cloud to immediately update their dependencies to the latest stable versions (React 19.2.1 or the relevant version of Next.js listed here), and redeploy their services.

This applies specifically to applications deployed on:

Cloud Run, Cloud Run functions, or App Engine: Update your application dependencies with the updated framework versions and redeploy.
Google Kubernetes Engine (GKE): Update your container images with the latest framework versions and redeploy your pods.
Compute Engine: The public OS images provided by Google Cloud do not have React or Next.js packages installed by default. If you have installed a custom OS with the affected packages, update your workloads to include the latest framework versions and enable WAF rules in front of all workloads.
Firebase: If you’re using Cloud Functions for Firebase, Firebase Hosting, or Firebase App Hosting, update your application dependencies with the updated framework versions and redeploy. Firebase Hosting and App Hosting are also automatically enforcing a rule to limit exploitation of CVE-2025-55182 through requests to custom and default domains.

Patching your applications is an essential step to eliminate the vulnerability at its source and ensure the continued integrity and security of your services.

We will continue to monitor the situation closely and provide further updates and guidance as necessary. Please refer to our official Google Cloud Security advisories for the most current information and detailed steps.

If you have any questions or require assistance, please contact Google Cloud Support and reference issue number 465748820.

11 ways to reduce your Google Cloud compute costs today

Mon, 06 Oct 2025 16:00:00 +0000

As the saying goes, "a penny saved is a penny earned," and this couldn't be more true when it comes to cloud infrastructure. In today's competitive business landscape, you need to maintain the performance to meet your business needs. Luckily, Google Cloud’s Compute Engine and block storage services offer numerous opportunities to reduce costs without sacrificing performance, especially in the context of your migration and modernization initiatives.

In this article, we'll explore 11 key ways to optimize your infrastructure spending on Google Cloud, from simple adjustments to strategic decisions that can result in significant long-term savings.

1. Choose the right VM instances

One of the most effective ways to reduce Compute Engine costs is to ensure that you’ve properly selected and right-sized your virtual machines (VMs) for their workloads to support your migration and modernization efforts. Whether you're new to Google Cloud or already using Compute Engine, adopting the latest-generation VMs — such as N4, C4, C4D, and C4A — can deliver substantial savings and improved price-performance.

Powered by Google Cloud’s Titanium architecture, our latest-generation VMs offer faster CPUs, higher memory bandwidth, and more efficient virtualization than their predecessors, so you can handle the same workloads with fewer resources. For existing customers, migrating from older VM generations to the newest VMs can significantly lower total costs while helping you exceed current performance levels. Organizations that have made the switch often report 20–40% better performance along with meaningful reductions in cloud compute spend. For example, Elastic leveraged the general-purpose C4A machine series based on Google Cloud's Arm-based Axion CPUs, to achieve a compelling efficiency and performance uplift for their workloads.

Beyond general-purpose VMs, we also offer specialized machine types to address unique customer requirements. Compute-optimized HPC VMs like H4D are designed for high-performance computing and data analytics, offering extreme performance for demanding workloads. M4 and X4 instances cater to memory-intensive applications, while Z3 instances are ideal for storage-intensive workloads. Furthermore, if you need complete control over your hardware environment and maximum performance isolation, we offer bare metal instances.

These options help ensure that even the most specialized and performance-sensitive workloads can find an optimal and cost-effective home within the Compute Engine portfolio.

2. Optimize your block storage selections

The best way to lower your block storage TCO, while ensuring your workloads remain successful, is to drive high resource efficiency. Hyperdisk makes it simple to drive high performance and high efficiency by enabling you to optimize your block storage to your workload and through Storage Pools. We’ll discuss each of these capabilities, and how you can use them to lower your block storage TCO below.

Workload Optimized: With Hyperdisk, you can independently tune capacity and performance to match your block storage resources to your workload. Hyperdisk enables you to independently provision performance and capacity at the volume level. You can leverage this capability to purchase just the capacity and performance you need, no more and no less. You can also take advantage of Hyperdisk Balanced’s “baseline” performance (i.e. included free with every volume), you can serve the vast majority of your VMs without purchasing any extra performance.

Storage Pools: Hyperdisk is the only hyperscale cloud block storage to offer thin-provisioned performance and capacity. With Hyperdisk Storage Pools, you can provision the aggregate performance and capacity your workload requires, while still provisioning the volume level capacity performance your workloads request (also known as thin-provisioning). This allows you to pay for the resources you need, not the sum of the volumes you’ve provisioned. As a result, you can lower your overall block storage TCO by as much as 50%.

For more information on how to select the right block storage for your workload and to see how customers have benefitted from Hyperdisk, read this blog.

3. Consider custom compute classes

To get the most out of our latest-generation VMs, Google Kubernetes Engine (GKE) custom compute classes (CCC) offer an advanced way to optimize compute choices and provide high availability. Instead of being limited to a single machine type for your workloads, you can define a prioritized list of VM instance types. This allows you to set the newest, most price-performant VMs — including our latest-generation VMs — as your top priority. GKE custom compute classes provide the capability to automatically and seamlessly spin up instances based on your specified priority list. This feature helps you maximize the availability of your compute capacity while still aiming for the most cost-effective options, so your workloads can scale reliably without manual intervention.

Here are some specific use cases for how custom compute classes can help you optimize costs:

Autoscaling cost-performant fallbacks: When demand peaks, you might be tempted to autoscale using a highly available but less cost-efficient VM type. CCC allows you to take a tiered approach. You can set up several cost-efficient fallback alternatives, so that as demand increases, GKE first attempts to use the most cost-effective options, and progressively moves to the other choices in your list when necessary to meet demand.
AI/ML inference: Running AI/ML inference workloads often involves significant compute resources. Instead of maintaining a large, static reservation that might sit idle during off-peak times, CCC lets you provision a minimal base reservation and leverage more cost-effective capacity types, such as Spot VMs, to handle peak inference demand — all orchestrated through your CCC configuration.
Adopting new VM generations: Combine the power of GKE custom compute classes with Compute Flexible committed use discounts (Flex CUDs) to de-risk the adoption of new, cost-efficient VM series like N4 and C4. With CCC, you can define fallback options, providing workload resilience, while Flex CUDs offer financial adaptability, as the discounts apply across your total eligible compute spend, regardless of the specific VM series you use. This dual approach is a safe, cost-effective strategy for leveraging the latest hardware without disruption. For more information, read this blog.
Using flexible Spot VMs: Spot VMs offer significant savings but can be preempted. Being constrained to a single Spot VM shape increases the risk that capacity will not be available. With CCC, you can define multiple fallback Spot VM types. This "spot surfing" capability allows the application to remain on cost-efficient Spot capacity by automatically pivoting to alternative Spot instance types if the primary choice is unavailable.

In short, by leveraging GKE CCC, you can artfully mix and match various VM types and consumption models, including On-Demand, Spot, DWS FlexStart, and instances covered by CUDs, to build a resilient and highly cost-optimized infrastructure that adapts to the unique needs and patterns of your workloads.

4. Leverage custom machine types (CMT)

Custom machine types, available on N4 VMs, allow you to precisely configure virtual machines to your exact specifications. Rather than selecting from predefined machine types that might include excess capacity, you can tailor the CPU-to-memory ratio specifically for your workloads, so you only pay for resources you actually use. This targeted approach minimizes waste and can significantly reduce your cloud spend, especially when migrating from on-premises to Google Cloud or from other cloud providers.

This flexibility becomes particularly valuable if your applications have unique resource profiles that don't align well with our standard offerings. Custom machine types let you create the perfect environment for your needs. By avoiding the compromise of over-provisioning certain resources while potentially constraining others, you can achieve both better performance and more efficient spending across your Compute Engine deployment.

As an example, take a memory-intensive workload that runs best with 16 vCPU, and 70 GB memory. Normally, you would need to pick a VM with 128 GB memory with our standard shapes, or in other cloud contexts, resulting in higher costs to run your workload due to the extra provisioned resources. Instead, with custom machine types, you can easily launch a VM with 16 vCPU and 70 GB memory, resulting in an 18% cost savings vs standard N4-highmem-16 VMs.

5. Make the most of committed use discounts

CUDs are a strategic cost-saving opportunity for organizations with steady, predictable computing needs. By committing to resource usage over one- or three-year periods, you can reduce cloud costs by up to 70% compared to on-demand pricing. This approach not only helps ensure budget predictability but also converts fixed infrastructure spending into a financial advantage, making it ideal for stable workloads that support core business functions.

Google Cloud offers flexible CUD structures to align with various operational models. Resource-based commitments target specific machine types and regions, flexible commitments apply discounts across projects, regions, and machine series — great for dynamic environments. By analyzing historical usage and forecasting future needs, you can identify workloads suited for these discounts, reinvesting the savings into innovation and scaling initiatives.

6. Manage unused disk space

You pay for the total provisioned disk space, regardless of how much you actually use. Many organizations tend to over-provision storage "just in case," which often leads to unnecessary and costly waste. For instance, if you provision a 100GB disk but only use 20GB, you're still paying for the entire 100GB. Being intentional and precise with your storage allocations — rather than rounding up to common sizes — can lead to significant cost savings.

To optimize spending, it's important to adopt a few best practices. Using Ops Agent, regularly audit disk usage across your infrastructure to identify and eliminate inefficiencies. Resize disks to align with actual consumption, allowing a reasonable buffer for growth. Implement automated alerts in Cloud Monitoring to detect underutilized disks and take corrective action. For stateless applications, consider using smaller boot disk images to minimize overhead and reduce costs even further.

In addition, consider the following optimization strategies to further reduce costs and improve efficiency:

Use Google Cloud’s monitoring tools to track CPU, memory, and disk usage over time.
Establish a regular review cycle to identify and right-size over-provisioned resources.
Test workloads across different VM configurations to find the optimal balance between cost and performance.

7. Use Spot VMs

Spot VMs provide the same machine types and configuration options as standard virtual machines but at a significantly reduced cost — typically offering a 60% to 91% discount. This cost efficiency comes with the tradeoff of potential preemption at short notice, making them most suitable for workloads that are fault-tolerant and can recover quickly from unexpected interruptions. Spot VMs are designed to take advantage of unused compute capacity, allowing you to optimize your cloud spending without compromising access to high-performance resources.

Strong use cases for Spot VMs include batch processing jobs, big data and analytics workloads, continuous integration and deployment (CI/CD) pipelines, stateless web servers running in autoscaling groups, and compute-heavy tasks. When properly architected to handle interruptions — for example, by using job checkpointing, load balancing, task queues, or via GKE custom compute classes (see more above) — Spot VMs can play a critical role in minimizing infrastructure costs while maintaining high availability and system resilience. Leveraging Spot VMs in these scenarios lets you scale cost-effectively, especially when compute demand is variable or time-flexible.

8. Use optimization recommendations

Google Cloud's Recommenders are a powerful tool designed to help you optimize your cloud resources efficiently. When browsing the Google Cloud console, you may see lightbulb icons next to specific resources — these indicate potential improvements identified by Google's recommendation engine. By analyzing real-time usage patterns and current resource configurations, the Recommender delivers actionable insights tailored to each user's unique environment. This intelligent system highlights opportunities not only to reduce costs but also to enhance security, performance, reliability, management efficiency, and environmental sustainability.

For example, there are idle VM recommendations to help you identify VM instances that have not been used over the last 1 to 14 days. Common recommendations include switching to more suitable machine types, rightsizing underutilized compute instances, or adopting more cost-effective storage solutions. The tool allows you to apply many of these changes directly, streamlining the optimization process. By continuously evaluating workloads and offering these automated, data-driven suggestions, the Recommendation Hub helps organizations maintain cloud performance while managing costs more effectively.

9. Take advantage of auto-scaling and scheduling

Matching your compute resources to actual demand patterns is one of the most effective ways to reduce cloud waste and improve overall cost efficiency. Many organizations over-provision their resources to handle peak workloads, leaving machines underutilized during off-peak periods. By aligning compute capacity more closely with real-time or predictable usage patterns, such as business hours or seasonal trends, you can significantly cut unnecessary spending without sacrificing performance.

Autoscaling is the key to achieving this efficiency. In fact, customers who leverage Google Compute Engine autoscaling for their virtual machines have seen average infrastructure cost savings of more than 40%.

You can implement autoscaling strategies to dynamically adjust resources based on CPU utilization, load balancing capacity, or custom application metrics, so that workloads receive the necessary compute power when needed, while scaling down automatically during low-demand periods.

For workloads with predictable patterns, such as those that fluctuate with business hours or planned seasonal events, schedule-based scaling is a particularly powerful tool. This approach allows you to proactively increase resources in anticipation of high demand and scale them down during lulls, for the performance you need without constant over-provisioning.

In addition to autoscaling, several practical implementation techniques can further optimize your resource usage. Setting up instance scheduling lets you automatically start and stop development and test environments according to business hours — a simple yet highly effective approach that can lead to cost savings of up to 70%. You can also leverage maintenance windows to reduce disruptions and resource consumption, by concentrating updates and system changes into low-usage periods. Together, these tactics help maintain high availability and performance while keeping infrastructure costs under control.

10. Understand your spend with detailed billing analysis

Before implementing any cost-saving strategies in Google Cloud, it’s essential to understand your current spending in detail. Google Cloud’s billing panel offers granular visibility into your expenses, including costs broken down by individual SKUs. This level of transparency lets you track where your money is going and identify potential inefficiencies. Begin by regularly reviewing your billing dashboard to monitor usage trends and spot anomalies. Applying labels and tags to your resources can further help categorize and attribute costs accurately, especially in complex environments with multiple projects or departments.

In addition, setting up budget alerts is a practical way to stay ahead of overspending by notifying you when costs approach or exceed predefined thresholds. It’s also important to identify and eliminate unused or idle resources, such as virtual machines or persistent disks that are no longer in active use — these can often be shut down or deleted to immediately reduce costs. By thoroughly analyzing your cost structure, you can uncover “low-hanging fruit” — resources that provide little or no value — and make data-driven decisions to optimize your cloud usage efficiently.

11. Consider serverless alternatives

Last but not least, Google Cloud's serverless computing offerings provide a compelling alternative to traditional virtual machines, can deliver better cost efficiency, simplified operations, and greater scalability. By abstracting away infrastructure management, serverless platforms allow teams to focus on writing and deploying code without worrying about provisioning, scaling, or maintaining servers. This shift can not only reduce operational overhead but also cut costs by aligning compute spending directly with application usage.

There are multiple serverless options available, each tailored to different workloads. Cloud Run is designed for running containerized applications that need rapid scaling and flexible deployment. Cloud Run Functions supports lightweight, event-driven code execution for microservices or automation tasks. GKE (Autopilot Mode) simplifies Kubernetes operations by automatically managing nodes and scaling, allowing you to run Kubernetes workloads without handling the underlying infrastructure. All these options charge based on usage not allocation, significantly reducing costs associated with idle resources and over-provisioning. This makes them especially beneficial for variable or unpredictable workloads. Cloud Run and GKE both support GPU’s and flexibility to move between the two. You can start with Cloud Run then move to GKE or vice-versa. Some customers also leverage both offerings for workloads. The rule of thumb is to start with GKE if you need access to the Kubernetes API. Otherwise, start with Cloud Run.

Start reducing your costs today

Migrate to Google Cloud and optimize your infrastructure costs without compromising on what your workloads need. If you are new to Google Cloud, start with a migration assessment. Google Cloud’s Migration Center can help you with a clear understanding of your potential savings by migrating to Google Cloud, with detailed recommended paths for your workloads, along with TCO reports. Apply the strategies in this article and unlock substantial cost savings.

Automate app deployment and security analysis with new Gemini CLI extensions

Wed, 10 Sep 2025 14:00:00 +0000

Find and fix security vulnerabilities. Deploy your app to the cloud. All without leaving your command-line.

Today, we’re closing the gap between your terminal and the cloud with a first look at the future of Gemini CLI, delivered through two new extensions: security extension and Cloud Run extension. These extensions are designed to handle critical parts of your workflows with simple, intuitive commands:

1) /security:analyze performs a comprehensive scan right in your local repository, with support for GitHub pull requests coming soon. This makes security a natural part of your development cycle.

2) /deploy deploys your application to Cloud Run, our fully managed serverless platform, in just a few minutes.

These commands are the first expression of a new extensibility framework for Gemini CLI. While we'll be sharing more about the full Gemini CLI extension world soon, we couldn't wait to get these capabilities into your hands. Consider this a sneak peak of what’s coming next!

Security extension: automate security analysis with /security:analyze

To help teams address software vulnerabilities early in the development lifecycle, we are launching the Gemini CLI Security extension. This new open-source tool automates security analysis, enabling you to proactively catch and fix issues using the /security:analyze command at the terminal or through a soon-coming GitHub Actions integration.

Integrated directly into your local development workflow and CI/CD pipeline, this extension:

Analyzes code changes: When triggered, the extension automatically takes the git diff of your local changes or pull request.
Identifies vulnerabilities: Using a specialized prompt and tools, Gemini CLI analyzes the changes for a wide range of potential vulnerabilities, such as hardcoded-secrets, injection vulnerabilities, broken access control, and insecure data handling.
Provides actionable feedback: Gemini returns a detailed, easy-to-understand report directly in your terminal or as a comment on your pull request. This report doesn't just flag issues; it explains the potential risks and provides concrete suggestions for remediation, helping you fix issues quickly and learn as you go.

And after the report is generated, you can also ask Gemini CLI to save it to disk or even implement fixes for each issue.

Getting started with /security:analyze

Integrating security analysis into your workflow is simple. First, download the Gemini CLI and install the extension (requires Gemini CLI v0.4.0+):

code_block: <ListValue: [StructValue([('code', 'gemini extensions install https://github.com/google-gemini/gemini-cli-security'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abc9feac0>)])]>

Then you can start run your first scan:

Locally: After making local changes, simply run /security:analyze in the Gemini CLI.
In CI/CD (Coming Soon): We're bringing security analysis directly into your CI/CD workflow. Soon, you’ll be able to configure the GitHub Action to automatically review pull requests as they are opened.

This is just the beginning. The team is actively working on further enhancing the extension's capabilities, and we are also inviting the community to contribute to this open source project by reporting bugs, suggesting features, continuously improving security practices and submitting code improvements.

For complete documentation and to contribute, visit the official GitHub repository.

Cloud Run extension: automate deployment with /deploy

The /deploy command in Gemini CLI automates the entire deployment pipeline for your web applications. You can now deploy a project directly from your local workspace. Once you issue the command, Gemini returns a public URL for your live application.

The /deploy command automates a full CI/CD pipeline to deploy web applications and cloud services from the command line using the Cloud Run MCP server. What used to be a multi-step process of building, containerizing, pushing, and configuring is now a single, intuitive command from within the Gemini CLI.

You can access this feature across three different surfaces – in Gemini CLI in the terminal, in VS Code via Gemini Code Assist agent mode, and in Gemini CLI in Cloud Shell.

Use /deploy command in Gemini CLI at the terminal to deploy application to Cloud Run

Get started with /deploy:

For existing Google Cloud users, getting started with /deploy is straightforward in Gemini CLI at the terminal:

Prerequisites: You'll need the gcloud CLI installed and configured on your machine and have an existing app or use Gemini CLI to create one.

Step 1: Install the Cloud Run extension
The /deploy command is enabled through a Model Context Protocol (MCP) server, which is included in the Cloud Run extension. To install the Cloud Run extension (Requires Gemini CLI v0.4.0+), run this command:

code_block: <ListValue: [StructValue([('code', 'gemini extensions install https://github.com/GoogleCloudPlatform/cloud-run-mcp'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abc9fe580>)])]>

Step 2: Authenticate with Google Cloud
Ensure your local environment is authenticated to your Google Cloud account by running:

code_block: <ListValue: [StructValue([('code', 'gcloud auth login\r\ngcloud auth application-default login'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abc9fed00>)])]>

Step 3: Deploy your app
Navigate to your application's root directory in your terminal and type gemini to launch Gemini CLI. Once inside, type /deploy to deploy your app to Cloud Run.

That's it! In a few moments, Gemini CLI will return a public URL where you can access your newly deployed application. You can also visit the Google Cloud Console to see your new service running in Cloud Run.

Besides Gemini CLI at the terminal, this feature can also be accessed in VS Code via Gemini Code Assist agent mode, powered by Gemini CLI, and in Gemini CLI in Cloud Shell, where the authentication step will be automatically handled out of the box.

Use /deploy command to deploy application to Cloud Run in VS Code via Gemini Code Assist agent mode.

Building a robust extension ecosystem

The Security and Cloud Run extensions are two of the first extensions from Google built on our new framework, which is designed to create a rich and open ecosystem for the Gemini CLI. We are building a platform that will allow any developer to extend and customize the CLI's capabilities, and this is just an early preview of the full platform's potential. We will be sharing a more comprehensive look at our extensions platform soon, including how you can start building and sharing your own.

Try Gemini CLI today, visit the GitHub here.

From localhost to launch: Simplify AI app deployment with Cloud Run and Docker Compose

Thu, 10 Jul 2025 09:30:00 +0000

At Google Cloud, we are committed to making it as seamless as possible for you to build and deploy the next generation of AI and agentic applications. Today, we’re thrilled to announce that we are collaborating with Docker to drastically simplify your deployment workflows, enabling you to bring your sophisticated AI applications from local development to Cloud Run with ease.

Deploy your compose.yaml directly to Cloud Run

Previously, bridging the gap between your development environment and managed platforms like Cloud Run required you to manually translate and configure your infrastructure. Agentic applications that use MCP servers and self-hosted models added additional complexity.

The open-source Compose Specification is one of the most popular ways for developers to iterate on complex applications in their local environment, and is the basis of Docker Compose. And now, gcloud run compose up brings the simplicity of Docker Compose to Cloud Run, automating this entire process. Now in private preview, you can deploy your existing compose.yaml file to Cloud Run with a single command, including building containers from source and leveraging Cloud Run’s volume mounts for data persistence.

Supporting the Compose Specification with Cloud Run makes for easy transitions across your local and cloud deployments, where you can keep the same configuration format, ensuring consistency and accelerating your dev cycle.

“We’ve recently evolved Docker Compose to support agentic applications, and we’re excited to see that innovation extend to Google Cloud Run with support for GPU-backed execution. Using Docker and Cloud Run, developers can now iterate locally and deploy intelligent agents to production at scale with a single command. It’s a major step forward in making AI-native development accessible and composable. We’re looking forward to continuing our close collaboration with Google Cloud to simplify how developers build and run the next generation of intelligent applications.” - Tushar Jain, EVP Engineering and Product, Docker

Cloud Run, your home for AI applications

Support for the compose spec isn’t the only AI-friendly innovation you’ll find in Cloud Run. We recently announced general availability of Cloud Run GPUs, removing a significant barrier to entry for developers who want access to GPUs for AI workloads. With its pay-per-second billing, scale to zero, and rapid scaling (which takes approximately 19 seconds for a gemma3:4b model for time-to-first-token), Cloud Run is a great hosting solution for deploying and serving LLMs.

This also makes Cloud Run a strong solution for Docker’s recently announced OSS MCP Gateway and Model Runner, making it easy for developers to take the AI applications locally to production in the cloud seamlessly. By supporting Docker’s recent addition of ‘models’ to the open Compose Spec, you can deploy these complex solutions to the cloud with a single command.

Bringing it all together

Let's review the compose file for the above demo. It consists of a multi-container application (defined in services) built from sources and leveraging a storage volume (defined in volumes). It also uses the new models attribute to define AI models and a Cloud Run-extension defining the runtime image to use:

code_block: <ListValue: [StructValue([('code', 'name: agent\r\nservices:\r\n webapp:\r\n build: .\r\n ports:\r\n - "8080:8080"\r\n volumes:\r\n - web_images:/assets/images\r\n depends_on:\r\n - adk\r\n\r\n adk:\r\n image: us-central1-docker.pkg.dev/jmahood-demo/adk:latest\r\n ports:\r\n - "3000:3000"\r\n models:\r\n - ai-model\r\n\r\nmodels:\r\n ai-model:\r\n model: ai/gemma3-qat:4B-Q4_K_M\r\n x-google-cloudrun:\r\n inference-endpoint: docker/model-runner:latest-cuda12.2.2\r\n\r\nvolumes:\r\n web_images:'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe688730>)])]>

Building the future of AI

We’re committed to offering developers maximum flexibility and choice by adopting open standards and supporting various agent frameworks. This collaboration on Cloud Run and Docker is another example of how we aim to simplify the process for developers to build and deploy intelligent applications.

Compose Specification support is available for our trusted users — sign up here for the private preview.

Making it easier to scale Kafka workloads with Cloud Run worker pools

Thu, 26 Jun 2025 16:00:00 +0000

Apache Kafka is vital to many event-driven architectures and streaming data pipelines. However, effectively scaling Kafka consumers — the applications processing data from Kafka topics — can be challenging.

Today, we’re excited to discuss two capabilities that make it more efficient and cost-effective to autoscale your Kafka consumer workloads on Cloud Run: Cloud Run worker pools (in public preview), and the open-source Cloud Run Kafka Autoscaler. We announced both of these capabilities at Google Cloud Next ’25.

The challenge: Scaling pull-based workloads

Kafka consumers operate on a “pull” model, where they actively fetch data from Kafka brokers. This architecture fundamentally differs from “push” systems, where data is sent to consumers. Consequently, metrics such as CPU utilization or incoming HTTP request throughput are not sufficient enough to determine processing demand. The true indicator of workload for a Kafka consumer is “offset lag”, which is the delta between the latest message offset available in a topic partition, and the last offset committed by the consumer group for that partition.

Incorporating queue-aware metrics like offset lag (which reside in the Kafka broker) as an autoscaling input can minimize message backlogs and optimize resource utilization.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f5abd68a8e0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Cloud Run worker pools for pull-based workloads

To solve the scaling challenge, you’ll first need an environment designed to run these pull-based workloads efficiently. This is where Cloud Run worker pools come in. They provide a purpose-built foundation for running Kafka consumers and other background processors, which was previously a challenging task on Cloud Run.

The three main Cloud Run resource types

While Cloud Run services are tailored for request-driven HTTP workloads and Cloud Run jobs for batch tasks that run to completion, worker pools are a distinct resource type well-suited for continuous, non-HTTP, pull-based background processing. They offer specific features that make them ideal for Kafka consumers:

Designed for background processing: Unlike services, worker pools don't require public HTTP endpoints. This reduces the network attack surface and simplifies application code, as you no longer need to manage ports for health checks.
Gradual deployments with instance splitting: Worker pools use deployment strategies tailored for pull-based workloads. Since these workloads don't handle HTTP traffic, rollouts are managed by splitting instances between revisions, rather than splitting traffic. For example, for a worker pool with four instances, you can allocate 25% (one instance) to a new canary revision and 75% (three instances) to the current, stable revision.
Significant cost savings: With worker pools, we charge up to 40% less for CPU and memory, compared to instance-billed Cloud Run services.

Worker pools are available in the Google Cloud CLI (gcloud beta run worker-pools), as an official Terraform resource, and in the reorganized Google Cloud console interface:

The Cloud Run user interface with the new worker pool resource

Queue-aware autoscaling with Kafka Autoscaler

While worker pools provide the right environment, you still need a mechanism to scale based on offset lag. The open-source Cloud Run Kafka Autoscaler is a tool you deploy that works with worker pools (or instance-billed services) to dynamically adjust consumer instances based on real-time demand.

It’s important to note that this is not a managed Google Cloud platform feature – it is an open-source tool that you control and deploy in your own project.

Key benefits:

Scaling based on actual Kafka metrics: The autoscaler connects directly to your Kafka cluster to monitor the total offset lag across partitions in your consumer group, and can also factor in consumer CPU utilization.
Automatically scales consumers down to zero: This eliminates costs during idle periods.
Cost-effective: Deployed as a request-billed Cloud Run service, the autoscaler itself is very cheap to run (less than $1 per month), since it is only active for brief periods during scaling checks.
Fine-grained and configurable scaling behavior: The autoscaler offers precise control over scaling policies, similar to the Kubernetes Horizontal Pod Autoscaler (HPA), allowing you to tailor the scaling behavior to meet your specific cost and performance goals. It provides several configurable levers, including:

- Target lag and CPU utilization thresholds
- A stabilization window to prevent rapid fluctuations in instance counts
- Scaling increment/decrement limits to control the rate at which instances are added or removed in a single scaling action

For a complete list of configuration options, please refer to the project documentation.

Cloud Run Kafka Autoscaler architecture diagram

Here’s how it works:

Perform autoscaling check: Cloud Scheduler periodically triggers the autoscaler to initiate a scaling evaluation.
Read Kafka offset lag: Once triggered, the autoscaler connects to the Kafka cluster to read offset lag, and (optionally) to Cloud Monitoring for the consumer’s CPU utilization.
Make scaling decision and actuate: Based on the collected metrics and user-defined scaling policies, the autoscaler computes the optimal number of consumer instances and uses Cloud Run’s manual scaling API to dynamically adjust the instance count without a new deployment.

Generalizing the pattern

The core architectural pattern of the Kafka autoscaler is simple: a Cloud Run service is periodically triggered to read custom metrics and adjust instance counts. This flexible model can be adapted for any pull-based workload, allowing you to scale your Cloud Run worker pools based on the metrics that matter most to your application.

If your application consumes from a different message queue or requires scaling based on your business metrics, you can build a similar dedicated autoscaler. Here are a few examples:

Autoscaling self-hosted Github runners: Dynamically scale your pool of self-hosted runners based on the number of pending jobs in your CI/CD queue. This ensures your builds run without delay while minimizing costs by scaling down — even to zero — when runners are idle.
Scaling on custom Prometheus metrics: Scale your worker pools based on any custom business metric you already expose in Prometheus, such as the number of items in a processing queue or active user sessions. This allows you to tie your infrastructure costs directly to real-time application demand.
Processing a Pub/Sub backlog: Adjust your number of workers based on the number of undelivered messages in a Pub/Sub subscription. This ensures timely message processing during traffic spikes, and saves money during quiet periods.

Cloud Run worker pools and the Kafka Autoscaler bring a new level of flexibility and ease of use to running Kafka, and we’re excited to see what you do with them. To learn more and get started:

Try out the open-source Cloud Run Kafka Autoscaler:

https://github.com/GoogleCloudPlatform/cloud-run-kafka-scaler (Terraform module)

Learn more about Cloud Run worker pools (documentation)
For feedback/questions on the autoscaler, please reach out to run-oss-autoscaler-feedback@google.com

If you are looking for a managed service for Apache Kafka, Google Cloud also offers a Managed Service for Apache Kafka with automated cluster management, Kafka Connect, and schema registry (in Preview) with built-in Google Cloud monitoring, logging, and IAM for simplified operations.

^{We would like to thank the Google Cloud team members who helped with this blog post: Andrew Manalo (Software Engineer, Serverless Scaling), Sagar Randive (Product Manager, Serverless) and Matt Larkin (Product Manager, Serverless)}

Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

Mon, 02 Jun 2025 16:00:00 +0000

Developers love Cloud Run, Google Cloud’s serverless runtime, for its simplicity, flexibility, and scalability. And today, we’re thrilled to announce that NVIDIA GPU support for Cloud Run is now generally available, offering a powerful runtime for a variety of use cases that’s also remarkably cost-efficient.

Now, you can enjoy the following benefits across both GPUs and CPUs:

Pay-per-second billing: You are only charged for the GPU resources you consume, down to the second.
Scale to zero: Cloud Run automatically scales your GPU instances down to zero when no requests are received, eliminating idle costs. This is a game-changer for sporadic or unpredictable workloads.
Rapid startup and scaling Go from zero to an instance with a GPU and drivers installed in under 5 seconds, allowing your applications to respond to demand very quickly. For example, when scaling from zero (cold start), we achieved an impressive Time-to-First-Token of approximately 19 seconds for a gemma3:4b model (this includes startup time, model loading time, and running the inference)
Full streaming support: Build truly interactive applications with out-of-the box support for HTTP and WebSocket streaming, allowing you to provide LLM responses to your users as they are generated.

Support for GPUs in Cloud Run is a significant milestone, underscoring our leadership in making GPU-accelerated applications simpler, faster, and more cost-effective than ever before.

“Serverless GPU acceleration represents a major advancement in making cutting-edge AI computing more accessible. With seamless access to NVIDIA L4 GPUs, developers can now bring AI applications to production faster and more cost-effectively than ever before.” - Dave Salvator, director of accelerated computing products, NVIDIA

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f5abd316820>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

AI inference for everyone

One of the most exciting aspects of this GA release is that Cloud Run GPUs are now available to everyone for NVIDIA L4 GPUs, with no quota request required.This removes a significant barrier to entry, allowing you to immediately tap into GPU acceleration for your Cloud Run services. Simply use --gpu 1 from the Cloud Run command line, or check the "GPU" checkbox in the console, no need to request quota:

Production-ready

With general availability, Cloud Run with GPU support is now covered by Cloud Run's Service Level Agreement (SLA), providing you with assurances for reliability and uptime. By default, Cloud Run offers zonal redundancy, helping to ensure enough capacity for your service to be resilient to a zonal outage; this also applies to Cloud Run with GPUs. Alternatively, you can turn off zonal redundancy and benefit from a lower price for best-effort failover of your GPU workloads in case of a zonal outage.

Multi-regional GPUs

To support global applications, Cloud Run GPUs are available in five Google Cloud regions: us-central1 (Iowa, USA), europe-west1 (Belgium), europe-west4 (Netherlands), asia-southeast1 (Singapore), and asia-south1 (Mumbai, India), with more to come.

Cloud Run also simplifies deploying your services across multiple regions. For instance, you can deploy a service across the US, Europe and Asia with a single command, providing global users with lower latency and higher availability. For instance, here’s how to deploy Ollama, one of the easiest way to run open models, on Cloud Run across three regions:

code_block: <ListValue: [StructValue([('code', 'gcloud run deploy my-global-service \\\r\n --image ollama/ollama --port 11434 \\\r\n --gpu 1 \\\r\n --regions us-central1,europe-west1,asia-southeast1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abe629100>)])]>

See it in action: 0 to 100 NVIDIA GPUs in four minutes

You can witness the incredible scalability of Cloud Run with GPUs for yourself with this live demo from Google Cloud Next 25, showcasing how we scaled from 0 to 100 GPUs in just four minutes.

Load testing a Stable Diffusion service running on Cloud Run GPUs to 100 GPU instances in four minutes.

Unlock new use cases with NVIDIA GPUs on Cloud Run jobs

The power of Cloud Run with GPUs isn't just for real-time inference using request-driven Cloud Run services. We're also excited to announce the availability of GPUs on Cloud Run jobs, unlocking new use cases, particularly for batch processing and asynchronous tasks:

Model fine-tuning: Easily fine-tune a pre-trained model on specific datasets without having to manage the underlying infrastructure. Spin up a GPU-powered job, process your data, and scale down to zero when it’s complete.
Batch AI inferencing: Run large-scale batch inference tasks efficiently. Whether you're analyzing images, processing natural language, or generating recommendations, Cloud Run jobs with GPUs can handle the load.
Batch media processing: Transcode videos, generate thumbnails, or perform complex image manipulations at scale.

What Cloud Run customers are saying

Don't just take our word for it. Here's what some early adopters of Cloud Run GPUs are saying:

"Cloud Run helps vivo quickly iterate AI applications and greatly reduces our operation and maintenance costs. The automatically scalable GPU service also greatly improves the efficiency of our AI going overseas.” - Guangchao Li, AI Architect, vivo

"L4 GPUs offer really strong performance at a reasonable cost profile. Combined with the fast auto scaling, we were really able to optimize our costs and saw an 85% reduction in cost. We've been very excited about the availability of GPUs on Cloud Run." - John Gill at Next'25, Sr. Software Engineer, Wayfair

"At Midjourney, we have found Cloud Run GPUs to be incredibly valuable for our image processing tasks. Cloud Run has a simple developer experience that lets us focus more on innovation and less on infrastructure management. Cloud Run GPU’s scalability also lets us easily analyze and process millions of images." - Sam Schickler, Data Team Lead, Midjourney

Get started today

Cloud Run with GPU is ready to power your next generation of applications. Dive into the documentation, explore our quickstarts, and review our best practices for optimizing model loading. We can't wait to see what you build!

Flipping out: Modernizing a classic pinball machine with cloud connectivity

Mon, 04 Nov 2024 17:00:00 +0000

In today's cloud-centric world, we often take for granted the ease with which we can integrate our applications with a vast array of powerful cloud services. However, there are still countless legacy systems and other constrained environments where integration is far from straightforward.

We faced this challenge head-on when building Backlogged Pinball, a custom pinball game that we built as a demo for integrating cloud services in uncommon environments. Backlogged Pinball is a physical pinball machine that connects to the cloud for a variety of services — think keeping track of data about current and completed games, updating leaderboards, etc. To build it, we used a base of a commercially available programmable pinball machine so we could focus on game code and cloud integration. However, the machine's software environment was limited, running on a sandboxed version of .NET Framework 3.5, which was first released 17 years ago. Practically, this meant that we couldn't use any of the modern Google cloud SDKs available for C#, and we couldn’t install tools like gcloud to help communicate with the cloud.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f5abe29b4f0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

There’s a catch

We knew we wanted to take advantage of the cloud for databases (for high scores, and stats from the game), logging (of game events and results), and a custom service (to change the game experience on the fly). But developing software for such a constrained environment presented a variety of challenges, which might be familiar to you:

Minimal library support: If you have full control over your stack, there’s no shortage of great libraries to help you connect to cloud services. But sometimes you don’t get to pick where your software runs. For our pinball machine, it was difficult to find compatible libraries to integrate with the cloud services we wanted. For example, we knew we wanted to insert records into a Firestore database to drive a real-time visualization of everything going on in the game. Firestore has great SDKs, but they couldn’t support anything before .NET Framework 4.6.2 (which is 8 years old). We might have been able to connect to a traditional relational database using a TCP connection, but we didn’t want to be limited in the cloud tools and services we could use. Needless to say, it’s much less practical to build a real-time web application with MySQL rather than Firestore, which is designed from the ground up to push data to the browser in real time.

Difficult deployment process: Maybe you have other limitations that make updating your on-device software difficult, but you still want to add new features and cloud integrations. As third-party developers, we had to manually install each version of our game during development using a USB stick. This kind of limitation slows down the rate at which you can test, deploy and ship new versions of your code, which is never good. It’s much easier to add new features in a modern, flexible cloud environment.

Fundamentally, we found it challenging to use modern cloud services in an uncertain legacy environment.

Flipper-ing the script

At first glance, there was no practical way to integrate all the services we wanted with the code that would run on the pinball machine. But what if there was another way? What if we turned the pinball machine itself into a service, and gave it a single minimal integration? Then we could have it send a message every time something happened in the game and sort out the results in a modern cloud environment.

We decided that Pub/Sub would be an excellent way to achieve this goal. It provided a way to get information to (and from!) the cloud with a single interface, with minimal complexity. It was just a basic HTTP POST of whatever message format we wanted.

To achieve this, we designed a custom Pub/Sub messaging system. We wrote our own lightweight Pub/Sub library for the pinball machine to handle authentication and message sending over the REST API, making it incredibly easy to post events whenever a player launched a ball, hit a target, or even pressed a flipper button. Check out a simplified version of that code on GitHub!

On the cloud side, our team used multiple Cloud Run subscribers to process these events in real time. We also used Firestore to store data and drive visualizations.

Jackpot! Cloud advantages

Pushing the complexity of integration into the cloud brought numerous advantages:

Single interface: Writing our own Pub/Sub client was no small task (authentication alone could be its own blog post!). But once it was done, it was done! Once it was working, we were able to focus on processing all the events in the cloud using whatever modern client libraries and tools we wanted.

Real-time updates: At Google Cloud Next, we helped users write their own Cloud Run services to receive pinball events, process them, and send messages back to the machine. Building and deploying these services took less than a minute, which meant you could conceivably change the game while a friend was playing it!

Rich data insights: We ended up with a fine-grained log of everything that happened in a game. This proved very helpful in troubleshooting issues during development and fine-tuning scoring based on playtesting.

Plunging forward

We’re already planning the next iteration of Backlogged Pinball with features we hadn’t originally considered. For example, we’re adding AI-powered game analysis and advice based on the player’s style. Thanks to this flexible cloud-based architecture, almost all the work will be in a modern cloud environment rather than fighting with dependencies on a legacy system. And the lessons we learned from this project are broadly applicable to any constrained environment. Whether it's an embedded system, an IoT device, or an old server running legacy software, by leveraging Pub/Sub messaging and adopting a cloud-first mindset, you can break free from the limitations of your environment and unlock the full potential of the cloud.

We’ll be showing off the latest Backlogged Pinball at KubeCon North America in November 2024. If you’re there, stop by to check it out!

^{Special thanks to Mofi Rahman, Google Cloud Advocate, for his contributions to this project and this post.}

Run your AI inference applications on Cloud Run with NVIDIA GPUs

Wed, 21 Aug 2024 15:00:00 +0000

Developers love Cloud Run for its simplicity, fast autoscaling, scale-to-zero capabilities, and pay-per-use pricing. Those same benefits come into play for real-time inference apps serving open gen AI models. That's why today, we’re adding support for NVIDIA L4 GPUs to Cloud Run, in preview.

This opens the door to many new use cases to Cloud Run developers:

Performing real-time inference with lightweight open models such as Google’s open Gemma (2B/7B) models or Meta’s Llama 3 (8B) to build custom chat bots or on-the-fly document summarization, while scaling to handle spiky user traffic.
Serving custom fine-tuned gen AI models, such as image generation tailored to your company's brand, and scaling down to optimize costs when nobody's using them.
Speeding up your compute-intensive Cloud Run services, such as on-demand image recognition, video transcoding and streaming, and 3D rendering.

As a fully managed platform, Cloud Run lets you run your code directly on top of Google’s scalable infrastructure, combining the flexibility of containers with the simplicity of serverless to help boost your productivity. With Cloud Run, you can run frontend and backend services, batch jobs, deploy websites and applications, and handle queue processing workloads — all without having to manage the underlying infrastructure.

At the same time, many workloads that perform AI inference, especially applications that demand real-time processing, require GPU acceleration to deliver responsive user experiences. With support for NVIDIA GPUs, you can perform on-demand online AI inference using the LLMs of your choice in seconds. With 24GB of vRAM, you can expect fast token rates for models with up to 9 billion parameters, including Llama 3.1(8B), Mistral (7B), Gemma 2 (9B). When your app is not in use, the service automatically scales down to zero so that you are not charged for it.

“With the addition of NVIDIA L4 Tensor GPU and NVIDIA NIM support, Cloud Run provides users a real-time, fast-scaling AI inference platform to help customers accelerate their AI projects and get their solutions to market faster — with minimal infrastructure management overhead.” - Anne Hecht, Senior Director of Product Marketing, NVIDIA

Early customers are excited about the combination of Cloud Run and NVIDIA GPUs.

“Cloud Run's GPU support has been a game-changer for our real-time inference applications. The low cold-start latency is impressive, allowing our models to serve predictions almost instantly, which is critical for time-sensitive customer experiences. Additionally, Cloud Run GPUs maintain consistently minimal serving latency under varying loads, ensuring our generative AI applications are always responsive and dependable — all while effortlessly scaling to zero during periods of inactivity. Overall, Cloud Run GPUs have significantly enhanced our ability to provide fast, accurate, and efficient results to our end users.” - Thomas MENARD, Head of AI - Global Beauty Tech, L’Oreal

“Cloud Run GPUs are hands-down the best way to consume GPU compute on Google Cloud. I love how it provides a high degree of control and customizability using open-source standards (Knative) as well as great observability tools out of the box, together with fully managed infrastructure that scales to zero. And since we can easily migrate to GKE using Knative primitives, there is always an option to get even more control at the cost of higher complexity and maintenance. GPU allocation and startup times were also faster for our use-case compared to most competing services.” - Alex Bielski, Director of Innovation, Chaptr

Using NVIDIA GPUs on Cloud Run

Today, we support attaching one NVIDIA L4 GPU per Cloud Run instance, and you do not need to reserve your GPUs in advance. To start, Cloud Run GPUs are available today in us-central1(Iowa), with availability in europe-west4 (Netherlands) and asia-southeast1 (Singapore) expected before the end of the year.

To deploy a Cloud Run service with NVIDIA GPUs, add the --gpu=1 flag to specify the number of GPUs and --gpu-type=nvidia-l4 flag to specify the type of GPU in the command line. Or, you can do this from the Google Cloud console:

And with the recently announced Cloud Run functions, you can also attach a GPU to your functions to perform event-driven AI inference with simplicity.

"The newly released Cloud Run functions with GPU support enables Python developers to use Hugging Face models without having to worry about infrastructure, GPU drivers or containers. Cloud Run's scales to zero and fast startup capabilities are a great match for developers looking at getting started with AI using HuggingFace models with just a few lines of serverless code” - Julien Chaumond, CTO, Hugging Face

Performance

Along with simple operations, Cloud Run with NVIDIA GPUs also offers strong performance. We keep our infrastructure latency to a minimum so that you can get the best performance when serving your models.

Cloud Run instances with an attached L4 GPU with driver pre-installed start in approximately 5 seconds, at which point the processes running in your container can start to use the GPU. Then, you’ll need another few seconds for the framework and model to load and initialize. The table below shows cold-start times for Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama3.1 8b models with the Ollama framework, ranging from 11 to 35 seconds. This measures the time to start an instance from 0, load the model in the GPU, and for the LLM to return its first word.

Model	Model Size	Cold Start Time
gemma:2b	1.7 GB	11-17 seconds
gemma2:9b	5.1 GB	25-30 seconds
llama2:7b	3.8 GB	14-21 seconds
llama2:13b	7.4 GB	23-35 seconds
llama3.1:8b	4.7 GB	15-21 seconds

^{Cold start time: Time taken for first invocation to the service URL for Cloud Run instance to go from 0-1 and serve the first word of the response.
Models: we used 4 bit quantized versions of each of the models above. These models were deployed using the Ollama framework.
Note that these numbers are observed in a controlled lab environment and actual performance numbers may vary depending on a variety of factors. “}

Deploy a sample app using Ollama

Below, you can see how to deploy Google’s Gemma2 9b model with Ollama using Cloud Run with NVIDIA GPUs. Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Ollama is a framework that provides a simple API to manage large language models.

First, create a container image with Ollama and the model with this Dockerfile:

code_block: <ListValue: [StructValue([('code', 'FROM ollama/ollama\r\nENV HOME /root\r\nWORKDIR /\r\nRUN ollama serve & sleep 10 && ollama pull gemma2\r\nENTRYPOINT ["ollama","serve"]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abd3a5490>)])]>

Then deploy using the following command:

code_block: <ListValue: [StructValue([('code', 'gcloud beta run deploy --source . --port 11434 --region us-central1 --no-cpu-throttling --cpu 8 --memory 32Gi --gpu 1 --gpu-type=nvidia-l4'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abd3a57f0>)])]>

And that’s it! Once deployed, you can use the Ollama API to start chatting with Gemma 2!

“Deploying a Large Language Model using Ollama on Cloud Run is remarkably straightforward, thanks to the latest GPU support. With just a few commands, you can leverage Ollama’s seamless integration with your app and Cloud Run’s serverless infrastructure to deploy, and manage your LLMs effortlessly. The fast coldstarts and rapid scaling of Cloud Run let you scale your application reliably. No deep knowledge of infrastructure or machine learning is required — simply focus on your application and let the tools handle the rest.” - Jeffrey Morgan, Founder, Ollama

Additionally, you can also leverage NVIDIA NIM inference microservices, part of the NVIDIA AI Enterprise software suite available in the Google Cloud Marketplace. This provides secure, reliable deployment of high-performance AI model inferencing accelerated to simplify AI inference deployments and maximize performance on NVIDIA L4 GPUs on Cloud Run. Check out this NVIDIA blog to learn how to get started.

Get started today

Cloud Run makes it super easy to host your web applications. And now with GPU support, we are extending the best of serverless, simplicity and scalability to your AI inference applications too! To start using Cloud Run with NVIDIA GPUs, sign up at g.co/cloudrun/gpu to join our preview program today and wait for our welcome email.

To learn more about Cloud Run with GPUs, join this livestream on August 21, 2024 with NVIDIA and Ollama. We will discuss new features for Cloud Run and demo how to use Cloud Run in different scenarios.

Cloud Functions is now Cloud Run functions — event-driven programming in one unified serverless platform

Wed, 21 Aug 2024 15:00:00 +0000

Cloud Functions and its familiar event-driven programming model is now Cloud Run functions, complete with the fine-grained control and scalability that developers love about the serverless platform. With Cloud Run functions, we’ve created a unified serverless platform for all your workloads, so you don’t have to choose between the two.

This goes beyond a simple name change. We’ve unified the Cloud Functions infrastructure with Cloud Run, and developers of Cloud Functions (2nd gen) get immediate access to all new Cloud Run features, including NVIDIA GPUs.

When Cloud Functions become Cloud Run functions, you can write and deploy functions directly with Cloud Run, giving you complete control over the underlying service configuration:

code_block: <ListValue: [StructValue([('code', 'gcloud beta run deploy hello-function \\\r\n --source . \\\r\n --function hello_get \\\r\n --base-image nodejs20'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f5abce01670>)])]>

A new deployment option for Cloud Run: the function

Further, all functions that were created with Google Cloud Functions (2nd gen) have access to all of Cloud Run’s capabilities, including:

Multi-event trigger management on functions
High-performance Direct VPC egress
The ability to mount Cloud Storage volumes
Google-managed language runtimes, with automatic security updates on base images
Traffic splitting and revision control
Managed Prometheus and OpenTelemetry support with sidecar containers
Inference functions with NVIDIA GPUs

"The newly released Cloud Run functions with GPU support enables Python developers to use Hugging Face models without having to worry about infrastructure, GPU drivers or containers. Cloud Run's scale-to-zero and fast startup capabilities are a great match for developers looking at getting started with AI using HuggingFace models with just a few lines of serverless code” - Julien Chaumond, CTO, Hugging Face

Continued support for existing APIs, gcloud commands and terraform modules

Cloud Functions 2nd gen functions will automatically be converted into Cloud Run functions. With Cloud Run functions, we are committed to continuing support for the existing functions APIs, gcloud commands and Terraform modules (Gen 2). This lets you enable Run features on your function without having to refactor your deployment automation.

1st gen functions will continue to be available as Cloud Run functions (1st gen). 1st gen functions need to be upgraded to Cloud Run functions before you can get full access to the underlying Cloud Run features. Cloud Run functions (1st gen) APIs, gcloud commands and Terraform modules (Gen1) will continue to be supported.

Connecting your platform with functions

Cloud Run functions makes connecting your platform simple to build and easy to maintain — you’re only responsible for the code, we’ll handle the rest. Anyone on your team with coding knowledge can create a solution without having to package up the code. You can also choose from seven popular languages. Data scientists, for example, can get a Python script running in the cloud even with limited infrastructure knowledge.

Edit your function in a new inline editor

Cloud Run functions keeps productivity high and operations low by making each function its own independent component, isolating it from directly impacting other workloads. Changes and updates to one function are unlikely to impact another function.

A common use case for functions is responding when an object is added to a Cloud Storage bucket. The function might generate thumbnails of an image or run sentiment analysis on a text file. But there are many other examples for which customers choose Cloud Functions:

Transforming data and loading it into BigQuery
Creating a webhook that’s called by a third party (e.g., GitHub)
Using ML APIs to analyze data added to a database or storage bucket

Get started with Cloud Run functions

Whether you're new to serverless or a seasoned pro, Cloud Run functions make it easier than ever to build and manage event-driven applications.

Learn more about improvements to the Cloud Functions experience
Deploy an HTTP function on Cloud Run
Deploy an Event driven function on Cloud Run
Learn more about running inference applications on Cloud Run with NVIDIA GPUs

Learn more about Cloud Run functions and Cloud Run in this live webinar.

Flexible committed-use discounts are now even more flexible

Mon, 15 Jul 2024 18:00:00 +0000

Google Cloud offers many great ways to run your workloads: low-level VMs in Google Compute Engine, container orchestration with Google Kubernetes Engine (GKE) — including via fully-managed Autopilot mode — and Cloud Run. Until now, to optimize your spend, you needed to purchase several Committed-use Discounts (CUDs) to cover each of these different products. For example, you might have purchased a Compute Engine Flexible CUD for VM spend including workloads running on GKE’s standard mode, a Cloud Run CUD for Cloud Run always-on instances, and an Autopilot CUD for workloads running in GKE Autopilot.

Expanding Compute Flexible CUDs

Today we are excited to announce that the Compute Engine Flexible CUD, now known as the Compute Flexible CUD, has been expanded to cover Cloud Run on-demand resources, most GKE Autopilot Pods and the premiums for Autopilot Performance and Accelerator compute classes. The documentation and our SKU list has the precise details on what’s included.

With one CUD purchase, you can cover eligible spend on all three products: Compute Engine, GKE, and Cloud Run. You can save 46% for a three-year commitment, and 28% for one-year commitments. With this single unified CUD, you can now make a single commitment and spend it across all these products, maximizing its flexibility. Furthermore, these commitments are not region-specific, so you can use them on resources in any region across these products.

Retiring the Autopilot CUD

Since the new expanded Compute Flexible CUD has a higher discount than the GKE Autopilot CUD and greater overall flexibility, we’re retiring the GKE Autopilot CUD. You can still purchase the legacy GKE Autopilot CUD until October 15, after which it will no longer be available for purchase. Any existing CUDs will continue to apply through their term regardless of when you purchase them. That said, we recommend looking into the newly expanded Compute Flexible CUD for your needs now and in the future, for its greater flexibility and better discounts!

How to get started

If you're already using Flexible CUDs for Compute Engine, you'll automatically see the discounts applied to eligible Cloud Run and GKE Autopilot usage (if you have product-specific CUDs like the legacy GKE Autopilot CUD, those will apply first). If you're new to Compute Flexible CUD, it's easy to get started: estimate your hourly spend across eligible SKUs, and purchase a commitment that matches your expected sustained usage over the one- or three-year term, and start enjoying the savings! You can add additional CUDs as your usage grows.

We hope you find this new flexibility useful when it comes to platforming your workloads on Google Cloud!

Next steps

Learn about Compute Flexible CUDs
View Cloud Run pricing
View GKE pricing and CUD options
Purchase a Compute Flexible CUD in the console

Releasing Artifact Registry assets across Organizations and Projects with serverless

Mon, 20 May 2024 16:00:00 +0000

Have you ever wondered if there is a more automated way to copy Artifact Registry or Container Registry Images across different projects and Organizations? In this article we will go over an opinionated process of doing so using serverless components in Google Cloud and its deployment with Infrastructure as Code (IaC).

This article assumes knowledge of coding in Python, basic understanding of running commands in a terminal and the Hashicorp Configuration Language (HCL) i.e. Terraform for IaC.

In this use case we have at least one container image residing in an Artifact Registry Repository that has frequent updates to it, that needs to be propagated to external Artifact Registry Repositories inter-organizationally. Although the images are released to external organizations they should still be private and may not be available for public use.

To clearly articulate how this approach works, let's first cover the individual components of the architecture and then tie them all together.

As discussed earlier, we have two Artifact Registry (AR) repositories in question; let’s call them “Source AR” (the AR where the image is periodically built and updated, the source of truth) and “Target AR” (AR in a different organization or project where the image needs to be consumed and propagated periodically) for ease going forward. The next component in the architecture is Cloud Pub/Sub; we need an Artifact Registry Pub/Sub topic in the source project that automatically captures updates made to the source AR. When the Artifact Registry API is enabled, Artifact Registry automatically creates this Pub/Sub topic; the topic is called “gcr” and is shared between Artifact Registry and Google Container Registry (if used). Artifact Registry publishes messages for the following changes to the topic:

Image uploads
New tags added to images
Image deletion

Although the topic is created for us, we will need to create a Pub/Sub subscription to consume the messages from the topic. This brings us to the next component of the architecture, Cloud Run. We will create a Cloud Run deployment that will perform the following:

Parse through the Pub/Sub messages
Compare the contents of the message to validate if the change in the Source AR warrants an update to the Target AR
If the validation conditions are met, then the Cloud Run service moves the latest Docker image to the Target AR

Now, let’s dive into how Cloud Run integrates with the Pub/Sub AR topic. For Cloud Run to be able to read the Pub/Sub messages we have two additional components; an EventArc trigger and a Pub/Sub subscription. The EventArc trigger is critical to the workflow as it is what triggers the Cloud Run service.

In addition to the components described above, the below prerequisites need to be met for the entire flow to function correctly.

Cloud SDK needs to be installed on the users’ terminal so that you can run gcloud commands.
The project Service Account (SA) will need “Read” permission on the Source AR.
The Project SA will need “Write” permission on the Target AR.
VPC-SC requirements on the destination organization (if enabled)
- Egress Permissions to the target repository from the SA running the job
- Ingress permission for the account running the 'make' commands (instructions below) and writing to Artifact Registry or Container Registry
- Ingress Permissions to read the PUB/SUB GCR Topic of the source repository
- Allow [project-name]-sa@[project-name].iam.gserviceaccount.com needs VPC-SC Ingress for the Artifact Registry method
- Allow [project-name]-sa@[project-name].iam.gserviceaccount.com needs VPC-SC Ingress for CloudRun method
- var.gcp_project
- Var.service_account

Below we talk about the Python code, Dockerfile and the Terraform code which is all you need for implementing this yourself. We recommend that you open our Github repository while reading the below section where all the Open Source code for this solution lives. Here’s the link: https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/devops/inter-org-artifacts-release

What we deploy in Cloud Run is a custom Docker container. It comprises of the following files:

App.py: This file contains the variables for the source and target containers as well as the execution code that will be triggered to run based on the Pub/Sub messages and contains the following Python code.

Copy_image.py: this file contains the copy command app.py will leverage in order to run the gcrane command required to copy images from source AR to target AR.

Dockerfile: This file contains the instructions needed to package gcrane and the requirements needed to build the Cloud Run image

Since we have now covered all of the individual components that are associated with this architecture, let’s walk through the flow that ties all the individual components together.

Let’s say your engineering team has built and released a new version of the Docker Image “Image X”, per their release schedule and added the “latest” tag to it. This new version is sitting in the Source AR and when the new version gets created, the AR Pub/Sub topic updates the message that reflects that a new version of the “Image X” has been added to the source AR. This automatically causes the EventArc trigger to poke the Cloud Run service to scrape the messages from the Pub/Sub subscription.

Our Cloud Run service will use the logic written in the App.py image to check if the action that happened in Source AR matches the criteria specified (Image X with tag “latest”). If the action matches and warrants a downstream action, Cloud Run triggers Copy_image.py to execute the gcrane command to copy the image name and tag from the Source AR to the Target AR.

In the event that the image or tag does not match the criteria specified in App.py, (for eg. Image Y tag: latest) the Cloud Run process will give back an HTTP 200 reply with a message “The source AR updates were not made to the [Image X]. No image will be updated.” confirming no action will be taken.

Note: Because the Source AR may contain multiple images and we are only concerned with updating specific images in the Target AR we have integrated output responses within the Cloud Run services that can be viewed in the Google Cloud logs for troubleshooting and diagnosing issues. This also prevents unwanted publishing of images not pertaining to the desired image(s) in question.

Why did we not go with an alternative approach?

Versatility: The Source and Target AR’s were in different Organizations
Compatibility: The Artifacts were not in a Code/Git repository compatible with solutions like Cloud Build.
Security: VPC-SC perimeters limit the tools we can leverage while using cloud native serverless options.
Immutability: We wanted a solution that could be fully deployed with Infrastructure as Code.
Scalability and Portability: We wanted to be able to update multiple Artifact Registries in multiple Organizations simultaneously.
Efficiency and Automation: Avoids a time-based pull method when no resources are being moved. Avoids human interaction to ensure consistency.
Cloud Native: Alleviates the dependency on third-party tools or solutions like a CI/CD pipeline or a repository outside of the Google Cloud environment.

If your Upstream projects where the images are coming from all reside in the same Google Cloud Region or Multi-region, a great alternative to solve the problem is Virtual repositories.

How do we deploy it with IaC?

We have provided the Terraform code we used to solve this problem.
The following variables will be used in the code. These variables will need to be replaced or declared within a .tfvars file and assigned a value based on the specific project.
- var.gcp_project
- Var.service_account

In conclusion, there are multiple ways to bootstrap a process for releasing artifacts across Organizations. Each method would have its pros and cons, the best one for the approach would be determined by evaluating the use case at hand. The things to consider here would be, if the artifacts can reside in a Git repository, if the target repository is in the same Organization or a child Organization and if CI/CD tooling is preferred.

If you have gotten this far it’s likely you may have a good use case for this solution. This pattern can also be used for other similar use cases. Here are a couple examples just to get you started:

Copying other types of artifacts from AR repositories like Kubeflow Pipeline Templates (kfp)
Copying bucket objects behind a VPC-SC between projects or Orgs

Learn more

Our solution code can be found here: https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/devops/inter-org-artifacts-release
GCrane: https://github.com/google/go-containerregistry/blob/main/cmd/gcrane/README.md
Configuring Pub/Sub GCR notifications: https://cloud.google.com/artifact-registry/docs/configure-notifications

Firestore integration with Eventarc reaches GA with Auth Context

Mon, 13 May 2024 16:00:00 +0000

Creating event-driven architectures using Eventarc together with Firestore is an increasingly popular pattern. Recently, the Firestore integration with Eventarc became generally available, adding new functionality. You can now register multiple Cloud Functions in different regions against a multi-regional Firestore database for increased reliability, and there are new event types, including the Auth Context extension for CloudEvents.

Determining who or what — the user, a service account, the system, or a third-party — is making a modification to a Firestore document as a change event has long been a top-requested feature. With the new Firestore event types with Auth Context extension, events now embed metadata about the principal that triggered a document change in the open and portable CloudEvents format.

Example walkthrough

Let’s say that you want to have different logic to process events in the destinations for different auth contexts (i.e. unauthenticated or system). To set up your trigger, navigate to the Eventarc section of the Google Cloud console. You’ll need to create a new trigger for Firestore using the associated event types that include authentication information. These event types end with the suffix *.withAuthContext. We’ll want to capture newly written entities, so we’ll select google.cloud.firestore.document.v1.written.withAuthContext events:

You can specify additional filters, which ensures only desirable events from a specified database and collection are delivered. In this case, we filter for events from the (default) database, and for documents of the collection Ops.

On the same screen, you’ll also need to specify a destination. Triggering events can be delivered to any number of supported Eventarc destinations, like Cloud Run, Cloud Functions (2nd gen), and Google Kubernetes Engine. Let’s say we have a Cloud Run service named demo that exposes an HTTP endpoint to receive the events. You can configure your trigger as follows:

That’s it! When any write operation is applied to your (default) database with the collection Ops, a CloudEvent with the Auth Context is delivered to the configured Cloud Run service demo almost immediately. You can inspect the authtype attribute as defined in Auth Context extension to identify unauthenticated and system types as shown in https://cloud.google.com/firestore/docs/extend-with-functions-2nd-gen#event_attributes .

Next steps

For more information on how to set up and configure Firestore triggers, check out our documentation.

^{Thanks to both Minh Nguyen, Senior Product Manager Lead for Firestore and Juan Lara, Senior Technical Writer for Firestore, for their contributions to this blog post.}

Direct VPC egress on Cloud Run is now generally available

Tue, 23 Apr 2024 16:00:00 +0000

Today, we're launching the general availability (GA) of Direct VPC egress for Cloud Run. This feature enables your Cloud Run resources to send traffic directly to a VPC network without proxying it through Serverless VPC Access connectors, making it easier to set up, faster, and with lower costs.

In fact, Direct VPC egress delivers approximately twice the throughput compared to both VPC connectors and the default Cloud Run internet egress path, offering up to 1 GB per second per instance. Whether you're sending traffic to destinations on the VPC, to other Google Cloud services like Cloud Storage, or to other destinations on the public internet, Direct VPC egress offers higher throughput and lower latency for performance-sensitive apps.

What's new since the preview

Notable improvements and new features:

All regions where Cloud Run is available are now enabled for Direct VPC egress.
Each Cloud Run service revision with Direct VPC can now scale beyond 100 instances as controlled by a quota. There is a standard quota increase request process if you need to scale even more.
Cloud NAT is supported, and Direct VPC egress traffic is now included in VPC Flow Logs and Firewall Rules Logging.

These updates address the top issues reported by our preview customers, especially larger customers with advanced scalability, networking, and security requirements.

Customer feedback

Many customers have been trying Direct VPC egress in preview since last year and have given us great feedback, including DZ BANK:

"With Direct VPC egress for Cloud Run, the platform team can more easily onboard new Cloud Run workloads because we no longer need to maintain Serverless VPC Access connectors and their associated dedicated /28 subnets. In our dynamic environment, where new Cloud Run services are created regularly, this simpler networking architecture saves us 4-6 hours per week of manual toil. We have also deprovisioned 30+ VPC connectors, saving on the additional compute costs for running them." - Tim Harpe, Senior Cloud Engineer, DZ BANK

If you enable direct VPC egress and send all your egress traffic to a VPC, you can leverage the same tools and capabilities for all your traffic – from Cloud Run, GKE, or VMs.

Next steps

Direct VPC egress is ready for your production workloads. Try it today and enjoy better performance and lower cost.

For a primer about how Direct VPC egress works, check out our preview blog post and its attached explainer video.

Attention DevOps engineers: Top managed container sessions to add to your Next ‘24 agenda

Fri, 05 Apr 2024 16:00:00 +0000

Google Cloud Next ‘24 is around the corner, and it’s the place to be if you’re serious about cloud development! Starting April 9 in Las Vegas, this global event promises a deep dive into the latest updates, features, and integrations for the services of Google Cloud’s managed container platform, Google Kubernetes Engine (GKE) and Cloud Run. From effortlessing scaling and optimizing AI models to providing tailored environments across a range of workloads — there’s a session for everyone. Whether you’re a seasoned cloud pro or just starting your serverless journey, you can expect to learn new insights and skills to help you deliver powerful, yet flexible, managed container environments in this next era of AI innovation.

Don’t forget to add these sessions to your event agenda — you won’t want to miss them.

Google Kubernetes Engine sessions

OPS212: How Anthropic uses Google Kubernetes Engine to run inference for ClaudeLearn how Anthropic is using GKEs resource management and scaling capabilities to run inference for Claude, its family of foundational AI models, on TPU v5e [Recording, Slides]

OPS200: The past, present, and future of Google Kubernetes EngineKubernetes is turning 10 this year in June! Since its launch, Kubernetes has become the de facto platform to run and scale containerized workloads. The Google team will reflect on the past decade, highlight how some of the top GKE customers use our managed solution to run their businesses, and what the future holds [Recording, Slides].

DEV201: Go from large language model to market faster with Ray, Hugging Face, and LangChain Learn how to deploy Retrieval-Augmented Generation (RAG) applications on GKE using open-source tools and models like Ray, HuggingFace, and LangChain. We’ll also show you how to augment the application with your own enterprise data using the pgvector extension in Cloud SQL. After this session, you’ll be able to deploy your own RAG app on GKE and customize it [Recording, Slides].

DEV240: Run workloads not infrastructure with Google Kubernetes Engine Join this session to learn how GKE's automated infrastructure can simplify running Kubernetes in production. You’ll explore cost -optimization, autoscaling, and Day 2 operations, and learn how GKE allows you to focus on building and running applications instead of managing infrastructure [Slides].

OPS217: Access traffic management for your fleet using Google Kubernetes Engine EnterpriseMulti-cluster and tenant management are becoming an increasingly important topic. The platform team will show you how GKE Enterprise makes operating a fleet of clusters easy, and how to set up multi-cluster networking to manage traffic by combining it with the Kubernetes Gateway API controllers for GKE [Slides].

OPS304: Build an internal developer platform on Google Kubernetes Engine Enterprise

Internal Developers Platforms (IDP) are simplifying how developers work, enabling them to be more productive by focusing on providing value and letting the platform do all the heavy lifting. In this session, the platform team will show you how GKE Enterprise can serve as a great starting point for launching your IDP and demo the GKE Enterprise capabilities that make it all possible [Recording, Slides].

Cloud Run sessions

DEV205: Cloud Run – What's new

Join this session to learn what's new and improved in Cloud Run in two major areas — enterprise architecture and application management [Recording, Slides].

DEV222: Live-code an app with Cloud Run and Flutter During this session, see the Cloud Run developer experience in real time. Follow along as two Google Developer Relations Engineers live-code a Flutter application backed by Firestore and powered by an API running on Cloud Run [Slides].

DEV208: Navigating Google Cloud - A comprehensive guide for website deploymentLearn about the major options for deploying websites on Google Cloud. This session will cover the full range of tools and services available to match different deployment strategies — from simple buckets to containerized solutions to serverless platforms like Cloud Run [Recording, Slides].

DEV235: Java on Google Cloud — The enterprise, the serverless, and the native In this session, you’ll learn how to deploy Java Cloud apps to Google Cloud and explore all the options for running Java workloads using various frameworks [Recording, Slides].

DEV237: Roll up your sleeves - Craft real-world generative AI Java in Cloud Run In this session, you’ll learn how to build powerful gen AI applications in Java and deploy them on Cloud Run using Vertex AI and Gemini models [Slides].

DEV253: Building generative AI apps on Google Cloud with LangChain Join this session to learn how to combine the popular open-source framework LangChain and Cloud Run to build LLM-based applications [Recording, Slides].

DEV228: How to deploy all the JavaScript frameworks to Cloud Run Have you ever wondered if you can deploy JavaScript applications to Cloud Run? Find out in this session as one Google Cloud developer advocate sets out to prove that you can by deploying as many JavaScript frameworks to Cloud Run as possible [Slides].

DEV241: Cloud-powered, API-first testing with Testcontainers and Kotlin Testcontainers is a popular API-first framework for testing applications. In this session, you’ll learn how to use the framework with an end-to-end example that uses Kotlin code in BigQuery and PubSub, Cloud Build, and Cloud Run to improve the testing feedback cycle [Slides].

ARC104: The ultimate hybrid example - A fireside chat about how Google Cloud powers (part of) Alphabet Join this fireside chat to learn about the ultimate hybrid use case — running Alphabet services in some of Google Cloud’s most popular offerings. Learn how Alphabet leverages Google Cloud runtimes like GKE, why it doesn’t run everything on Google Cloud, and the reason some products run partially on cloud [Slides].

DEV202: Accelerate your AI with Serverless

Serverless platforms and generative AI applications are a great match. In this talk you'll learn how Google Cloud's pay-as-you-go model for serverless runtimes can be used to supplement your generative AI model with function calling [Recording, Slides].

Dev299: A java developer walks into a serverless barThis session is for Java developers who want to learn how to deploy their apps to Google Cloud. It offers a practical guide to considerations, challenges, tips and tricks for optimizing your JVM for Serverless environments [Recording, Slides]

Firebase sessions

DEV221: Use Firebase for faster, easier mobile application development

Firebase is a beloved platform for developers, helping them develop apps faster and more efficiently. This session will show you how Firebase can accelerate application development with prebuilt backend services, including authentication, databases and storage [Recording, Slides].

DEV243: Build full stack applications using Firebase and Google Cloud

Firebase and Google Cloud can be used together to build and run full stack applications. In this session, you’ll learn how to combine these two powerful platforms to enable enterprise-grade applications development and create better experiences for users [Slides].

DEV107: Make your app super with Google Cloud Firebase Learn how Firebase and Google Cloud are the superhero duo you need to build enterprise-scale AI applications. This session will show you how to extend Firebase with Google Cloud using Gemini — our most capable and flexible AI model yet — to build, secure, and scale your AI apps [Slides].

DEV250: Generative AI web development with AngularIn this session, you’ll explore how to use Angular v18 and Firebase hosting to build and deploy lightning-fast applications with Google's Gemini generative AI [Recording, Slides].

See you at the show!