Cloud Operations

Flexible committed use discounts — a simple new way to discount Compute Engine instances

Fri, 04 Nov 2022 16:00:00 +0000

Saving money never goes out of style. Today, many of our customers use Compute Engine resource-based committed use discounts (CUDs) to help them save on steady-state compute usage within a specific machine family and region. As part of our commitment to offer more flexible and easy ways for you to manage your spend, we now offer a new type of committed use discount for Compute Engine: flexible CUDs.

Flexible CUDs are spend-based commitments that offer predictable and simple flat-rate discounts (28% off 1-year, and 46% off 3-years) that apply across multiple VM families and regions. Similar to resource-based CUDs, you can apply flexible CUDs across projects within the same billing account, and to VMs of different sizes and tenancy, to support changing workload requirements while keeping costs down.

Today, Compute Engine flexible CUDs are available for most general-purpose (N1, N2, E2, N2D) and compute-optimized (C2, C2D) VM usage, including instance CPU and memory usage across all regions (refer to the complete list with more VM families to come).

Similar to resource-based CUDs, flexible CUDs are discounts over usage, not capacity reservations. To ensure capacity availability, make a separate reservation, and CUDs will apply automatically to any eligible usage as a result.

You can purchase CUDs from any billing account, and the discount can apply to any eligible usage in projects paid for by that billing account. When you purchase a flexible CUD, you pay the same commitment fee for the entirety of the commitment term, even if your usage falls below this commitment value. The commitment fee is billed monthly. Once a commitment is purchased, it cannot be canceled.

For the best combination of savings and flexibility, you can combine resource-based CUDs and flexible CUDs together. You can have standard resource-based CUDs to cover your most stable resource usage and flexible spend based CUDs to cover your more dynamic resource usage. Every hour, standard CUDs apply first to any eligible usage followed by flexible CUDs, optimizing the use of your CUDs. Finally, any usage overage or usage that’s not eligible for either type of CUDs, will be charged based on your on-demand rates.

Here is a quick summary of the differences between resource based CUDs and flexible CUDs

What customers are saying about flexible CUDs

“Media.net is a global company with dynamic resource requirements. With flexible CUDs, Media.net is able to quickly and easily save money on baseline workload requirements, while giving it the flexibility to use different machine types and regions. Media.net chose Spot VMs after exploring various options to support spiky workloads, as they provided Media.net with both deep discounts and simple, predictable pricing. Flexible CUDs and Spot VMs were the perfect combination to optimize costs for the dynamic capacity needs of the business.” — Amit Bhawani, Sr VP of Engineering, Media.net

“As Lucidworks expands our product offerings, Google Cloud's Flexible CUDs have been the perfect solution to optimize cost while giving us the flexibility to shift workloads to different regions based on customer demographics and different instance families based on performance characteristics." — Matt Roca, Director of Cloud Governance and Security, Lucidworks

Understanding flexible CUDs

You can purchase a flexible CUD in the Google Cloud console or via the API. A flexible CUD goes into effect one hour after purchase, and the discounts will automatically be applied to any eligible usage.

Your flexible CUD is applied to eligible on-demand spend by the hour. If during a given hour, you spend less than what you committed to, you will not fully utilize your commitment or realize your full discount.

For example: If you want to cover $100 worth of on-demand spend every hour by a flexible CUD, you will pay $54/hour (46% off) for 3 years (payable monthly), and receive a $100 credit that applies automatically to any eligible on-demand spend. The $100 credit burns down at the eligible on-demand rate for every eligible SKU, and expires if unused.

Attributing flexible CUDs credits
If you are running multiple projects within the same billing account, the credits from flexible CUDs are attributed proportionally across projects within the billing account and across SKUs within the same project according to their usage proportion.

Planning for flexible CUDs purchases
A good way to think about how to purchase and use resource based CUDs with flexible CUDs is to first forecast and purchase resource based CUDs based on your steady state resource spend, to get the deepest discounts. A best practice is to use flexible CUDs for more variable and growing workloads, and to use on-demand VMs, or Spot VMs, for the rest of your usage.

Get started with flexible CUDs today

Building a business in the cloud can be complicated; paying for it should be easy. We designed flexible CUDs to make it easy for organizations to enjoy significant discounts across a wide variety of Google Cloud resources in a way that’s simple and predictable. For more details on how to purchase and use flexible CUDs and to get started, refer to this documentation.

Some beans and gems, some snakes and elephants, with Java 17, Ruby 3, Python 3.10 and PHP 8.1 in App Engine and Cloud Functions

Wed, 13 Apr 2022 18:30:00 +0000

Time to spill the beans and show the gems, to our friendly snakes and elephants: we’ve got some great news for Java, Ruby, Python and PHP serverless developers today. Google App Engine and Cloud Functions are adding new modern runtimes, allowing you to update to the major version release trains of those programming languages.

In short, here’s what’s new:

Access to App Engine legacy bundled services for Java 11/17, Python 3 and Go 1.12+ runtimes, is Generally Available
The Java 17, Ruby 3.0, Python 3.10, and PHP 8.1 runtimes come into preview in App Engine and Cloud Functions

Let’s have a closer look. First of all, the access to App Engine legacy bundled services for second generation runtimes for Java, Python and Go is now Generally Available. In the past, for example for the Java platform, only Java 8 (a first generation runtime) could access the built-in APIs like Memcache, Images, Mail, or Task Queues. Now, if you use the Java 11 runtime (a second generation runtime), you can access those services as well as all the Google Cloud APIs. For example, you can now store transient cached data in Memcache, or send an email to users of your application in a second generation runtime. Same thing for Python and Go developers, you can take advantage of the bundled services as well. If you’re still using an old runtime version, this will further ease the transition to newer versions. Be sure to check it out and upgrade.

Next, let’s continue with a fresh bean and a shiny gem, mixed in with some friendly animals, with the preview of the Java 17, Ruby 3.0, Python 3.10 and PHP 8.1 runtimes for both App Engine and Cloud Functions. What about having a look at what’s new in those language versions?

Java

Between the two Long-Term-Support versions of Java 11 and 17, a lot of new features have landed. Java developers can now write text blocks for strings spanning several lines, without having to concatenate multiple strings manually. The switch construct has evolved to become an expression, which lets you break away from the break keyword, and paves the way for more advanced pattern matching capabilities. Speaking of which, the instanceof keyword is indeed offering some pattern matching evolution, to avoid obvious but useless casts. Records allow you to create more streamlined immutable data classes, rather than writing your own Java beans for that purpose with proper hashCode(), equals() or toString() methods. For more control over your class hierarchy, sealed class gives you more control over the extension of your classes.

Ruby

With Ruby 3.0, the big highlights were on performance, static type checking, and concurrency. The goal to make Ruby 3.0, three times faster on some benchmarks than Ruby 2.0 was reached, making your code run more swiftly. Also, Ruby programs can be annotated with some typing information, which allow type checkers to take advantage of those types to provide static type checking, to improve the quality of your code. For concurrency and parallelism, a new actor-inspired concurrency primitive called Ractor helps taming multiple cores in parallel, for your demanding workloads. And a fiber scheduler is introduced for intercepting blocking operations. And beyond those headlines, many improvements to various Ruby APIs have also landed.

Python

In Python 3.10, the parser gives better and clearer error messages for syntax errors (with more accurate error location), also for indentation, attribute, and name errors, which greatly help developers to find the problems in their code. Structural pattern matching lands with a new match and case statement construct. Further PEP improvements are tackling more robust type hints for static type checkers. Parenthesized context managers have been added to make the code prettier when spanning a long collection of context managers across several lines.

PHP

With version 8.1, PHP gets a pretty major update. First, let’s start with a new enum syntax construct instead of creating constants in classes, and you get validation out of the box. Classes now have the ability to define final class constants. The new readonly properties can’t be changed after initialization, which is great for value objects and DTOs. A first class callable syntax is introduced, allowing you to get a reference to any function, with a short syntax. Developers will also find further improvements to initializers, that make it possible to even have nested attributes, using objects as default parameter values, static variables, and global constants. One last nice addition we can mention is the introduction of fibers to implement lightweight cooperative concurrency.

Your turn

Gems, beans, elephants, snakes: there’s something great in those new language versions for every developer. Thus, with these new runtimes in Preview, Java, Ruby, Python and PHP developers can update or develop new App Engine apps and Cloud Functions using the latest and greatest versions of their favorite languages. Be sure to check out the documentation for App Engine (Java, Ruby, Python, PHP) and Cloud Functions (Java, Ruby, Python, PHP). We’re looking forward to hearing from you about how you’ll take advantage of those new language runtimes.

Optimize your applications using Google Vertex AI Vizier

Wed, 26 Jan 2022 18:30:00 +0000

Businesses around the globe are continuing to benefit from innovations in Artificial Intelligence (AI) and Machine Learning (ML). At F5, we are using AI/MI in meaningful ways to improve data security, fraud detection, bot attack prevention and more. While the benefits of AI/ML for these business processes are well articulated, at F5, we also use AI/ML to optimize our software applications.

Using AI/ML for better software engineering is still in its early days. We are seeing use cases around AI assisted code completion, auto-code generation for no-code/low-code platforms but we are not seeing broad usage of AI/ML in optimizing the software application architecture itself. In this blog, we will demonstrate workload optimization of a data pipeline using black-box optimization with Google’s Vertex AI Vizier.

Performance Optimization

Today, software optimization is an iterative and mostly manual process where profilers are used to identify the performance bottlenecks in software code. Profilers measure the software performance and generate reports that developers can review and further optimize the code. The drawback of this manual approach is that the optimization depends on developer's experience and hence is very subjective. It is slow, non-exhaustive, error prone and susceptible to human bias. The distributed nature of cloud native applications further complicates the manual optimization process.

An under-utilized and more global approach is another type of performance engineering that relies on performance experiments and black-box optimization algorithms. More specifically, we aim to optimize the operational cost of a complex system with many parameters. Other experiment-based performance optimization techniques exist, such as causal profiling, but are outside the scope of this post.

As illustrated in Figure 1, the process to optimize the performance is iterative and automated. A succession of controlled trials is performed on a system to study the value of a cost function characterizing the system to be optimized. New candidate parameters are generated, and more trials are performed until this results in too little improvement to be worthwhile. More details on this process later in this post.

Figure 1: Black-box optimization - Iterative experiments to arrive at optimal output as a cost function

What is the problem

Let's first set the stage - partly inspired by our experience, partly fictitious for the purpose of this discussion.

Our objective is to build an efficient way to get data from PubSub to BigQuery. Google Cloud offers a fully managed data processing service, Dataflow for executing a wide variety of data processing patterns which we use for multiple other realtime streaming needs. We opted to leverage a simplified custom stream processor for this use case for processing and transformations to benefit from the 'columnar' orientation of BQ — sort of 'E(t)LT' model.

The setup for our study is illustrated in more detail in Figure 2. The notebook in the central position plays the role of orchestrator for the study of the 'system under optimization'. The main objectives (and components involved) are:

Reproducibility: in addition to an automated process, a pub/sub snapshot is used to initialize a subscription specifically created to feed the stream processor to reproduce the same conditions for each experiment.
Scalability: the Vertex AI Workbench implements a set of automated procedures used to run multiple experiments in parallel with different input parameters to speed up the overall optimization process.
Debuggability: for every experiment the study and trial ids are systematically injected as labels for each log and metric produced by the stream processor. In this way, we can easily isolate, analyze, and understand the reasons for a failed experiment or one with surprising results.

Figure 2: High level architecture for conducting the experiments

To move the data from PubSub to BigQuery efficiently, we designed and developed some code and now want to refine it to be as efficient as possible. We have a program, and we want to optimize based on performance metrics that are easy to capture from running it. Our question now is how do we select the best variant?

Not too surprisingly, this is an optimization problem: the world is full of them! Essentially, these problems are all about optimizing (minimizing or maximizing) an objective function under some constraints and finding the minima or maxima where this happens. Because of their widespread applicability, this is a domain that has been studied extensively.

The form is typically:

- read as we want the x of a certain domain X that minimizes a cost function f. Since it is a minimization problem here, such x are called minima. Minima don’t necessarily exist and when they do are not necessarily unique. Not all optimization problems are equal: continuous and linear programming is 'easy', convex optimization is still relatively easy, combinatorial optimization is harder... and this is assuming we can describe the objective function we want to optimize — even partially as with being able to compute gradients.

In our case, the objective function is some performance (still TBD at this point) of some program in some execution environment. This is far from f(x)=x²: we have no analytical expression for our program performance, no derivatives, no guarantee that the function is convex, the evaluation is costly, and the observation can be noisy. This type of optimization is called 'black-box optimization' for the reason that we cannot describe it in simple mathematical terms our objective function. Nonetheless we are very much interested in finding the parameters that deliver the best result.

Let's now frame our situation as a concrete optimization problem before we introduce further black-box optimization and some tools as we are looking for a way to automate solving this type of problems rather than doing it manually — 'time is money' as they say.

Framing as an optimization problem

Our problem has many moving parts but not all have the same nature.

Objective

First comes the objective. In our case, we want to minimize the cost per byte of moving data from PubSub to BigQuery. Assuming that the system scales linearly in the domain we are interested in, the cost per byte processed is independent of the number of instances and allows to extrapolate precisely the cost to reach a defined throughput.

How do we get there?

We run our program on a significant and known volume of data in a specified execution environment — think specific machine type, location, and program configuration —, measure how long it takes to process it and calculate the cost of the resources — named `cost_dollar` below. This is our cost function f.

As mentioned earlier, there is no simple mathematical expression to define the cost function of our system and evaluating it actually involves running a program and is 'costly' to evaluate.

Parameter space

Our system has numerous knobs: the program has many configuration parameters corresponding to alternative ways of doing things we want to explore and sizing parameters such as different queue size or number of workers. The execution environment defines even more parameters: VM configuration, machine type, OS image, location, ... In general, the number of parameters can vary wildly — for this scenario, we have a dozen.

In the end, our parameter space is described by Table 1 which for each `parameter_id` gives the type of value (integer, discrete or categorical).

The objective has been identified, we know how to evaluate it for some given assignment of a collection of identified parameters and we have defined the domain of these parameters. That's what we need to allow us to do some black-box optimization.

Approach

Back to the black-box optimization: we already stated this is a problem dealing with minimization/maximization of a function for which we have no expression, we can still evaluate it! We just need to run an experiment and determine the cost.

The issue is running the experiment has a cost and given the parameter space, exploring them all is rapidly not a viable option. Assuming you pick only 3 values for each of the 12-ish parameters: 3¹²=531,441 — it's large already. This method of exploring systematically all the combinations generated from a subset of each parameter taken individually is called grid search.

Instead, we use a form of surrogate optimization: In case like this one where there is no convenient representation of our objective function, it can be beneficial to introduce a surrogate function with better properties that models the actual function. Certainly, instead of one problem: minimizing our cost function, we have two: fitting a function on our problem and minimizing it. But we gained a recipe to move forward: fit a model on the observations and use this model to help choose a promising candidate for which we need to run an experiment. Once we have the result of the experiment, the model can be refined and new candidates can be generated, until marginal improvements are not worth the effort.

Google Cloud Vertex AI Vizier offers this type of optimization as a service. If you want to read more about what is behind — spoiler: it relies on Gaussian Process (GP) optimization, check this publication for a complete description: Google Vizier: A Service for Black-Box optimization.

We performed 148 different experiments with different combinations of input parameters. What have we learned?

Results of our study

The point of this discussion is not to detail precisely what parameters we used to get the best cost - this is not transferable knowledge as your program, setup, and pretty much everything will be different. But to give an idea of the potential of the method: in our case, with 148 runs, our cost function went from $0.0780/run with our initial guessed configuration down to $0.0443/run with the best parameters — a reduction of the cost of 43%. Unsurprisingly, the `machine_type` parameter plays a major role here, but even with the same machine type as the one offering the best results, the (explored) portion of our cost function ranges between $0.0443/run and $0.0531/run - a variation of 16%.

The most promising runs are represented in Figure 3. All axes but the last 2 correspond to parameters. The last two, respectively, the objective `cost_dollar`, and represent whether the run completed or not. Lines represent the runs and connect the values for each axis together corresponding to them.

To conclude on the study part, this helped us uncover substantial cost improvement with almost no intervention from our end. Let's explore that aspect more in the next section.

Learnings on the method

One of the main advantages of this method is that: provided you have been through the initial effort of setting things up suitably, it can run on its own and require little to no human intervention.

Black-box optimization assumes the evaluation of f(x) only depends on x not on what else is going on at the same time.

We don't want to see interactions between the different evaluations of f(x).

One of the main applications of Vizier is deep learning model hyper-parameter optimization. The training and evaluation are essentially devoid of side effects — cost aside, but we already said black-box optimization methods assume the evaluation is costly and are designed to reduce the number of runs needed to find the optimal parameters. Our scenario has definitively side-effects: it is moving data from one place to another.

So, if we ensure all side-effects are removed from our performance experiment, life should be easy on us. Black-box optimization methods apply, and Vizier in particular can be used. This can be achieved by wrapping the execution of our scenario in some logic to setup and tear down an isolated environment: making this whole new system essentially without side-effect.

Couple of lessons on running these kinds of tests we think worth highlighting:

Parameterize everything even if there is a single value at first: if another value becomes necessary it is easy to add, worst case: values are recorded along with your data making it easier to compare things between different experiments if needed.
Isolation between runs and other things: if it is not parameterized and have an impact on the objective, this will make the measurements noisier and make it harder for the optimization process to be decisive about where to explore next.
Isolation between concurrent runs: so can run multiple experiments at once.
Robust runs: not all combinations of parameters are feasible, and Vizier supports reporting them as so.
Enough runs: Vizier leverages the result of previous runs to decide what to explore next and you can request for a number of experiments to run at once - without having to provide the measurement yet. This is useful to start running experiments in parallel but in our experience this is also useful to make sure initially you have a broad coverage or the parameter space before the exploration starts to try to pinpoint local extrema. For example, in the set of runs we described earlier in the post, 'n2-highcpu-4' didn't get tried until run 107.
Tools exist today: Vizier is one example available as a service. There are many python libraries available too to do black-box optimization.

Definitely something to have in your toolbox if you don't want to spend hours with knobs and you prefer a machine doing that.

Conclusion and next steps

Black-box optimization is unavoidable for ML hyper-parameter tuning. Google Vertex AI Vizier is a black-box optimization service with a wider range of applications. We believe it is also a great tool for the engineering of complex systems that are characterized by many parameters with essentially unknown or difficult to describe interactions. For small systems, manual and/or systematic exploration of the parameters might be possible, but the point of this post is that it can be automated!

Optimizing performance is a recurring challenge as everything keeps changing and new options and/or new usage patterns appear.

The setup presented in this post is relatively simple and very static. There are natural extensions of this setup to continuous online optimization that are worth exploring from a software engineering perspective like multi-armed bandits.

What if the future of application optimization was already here but not very evenly distributed - to paraphrase William Gibson?

Think this is cool? F5 AI & Data group hires!

References

Creating custom notifications with Cloud Monitoring and Cloud Run

Wed, 19 Jan 2022 17:00:00 +0000

The uniqueness of each organization in the enterprise IT space creates interesting challenges in how they need to handle alerts. With many commercial tools in the IT Service Management (ITSM) market, and lots of custom internal tools, we equip teams with tools that are both flexible and powerful.

This post is for Google Cloud customers who want to deliver Cloud Monitoring alert notifications to third-party services that don’t have supported notification channels.

It provides a working implementation of integrating Cloud Pub/Sub notification channels with the Google Chat service to forward the alert notifications to Google Chat rooms and demonstrates how this is deployed on Google Cloud. Moreover, it outlines steps for continuous integration using Cloud Build, Terraform, and GitHub. All the source code for this project can be found in this GitHub repository.

It is worth noting that the tutorial provides a generic framework that can be adapted by Google Cloud customers to deliver alert notifications to any 3rd-party services that provide Webhook/Http API interfaces.

Instructions for how to modify the sample code to integrate with other 3rd-party services is explained in the section “Extending to other 3rd-party services“.

Objectives

Write a service to forward Google Cloud Monitoring alert notifications from Cloud Monitoring Pub/Sub notification channels to a third-party service.
Build and deploy the service to Cloud Run using Cloud Build, Terraform, and GitHub.

Costs

This tutorial uses billable components of Google Cloud:

Cloud Build
Cloud Compute Engine (GCE)
Cloud Container Registry
Cloud Pub/Sub
Cloud Run
Cloud Storage

Use the pricing calculator to generate a cost estimate based on your projected usage.

Before you begin

For this tutorial, you need a GCP project. You can create a new project or select a project that you've already created:

Select or create a Google Cloud project.
Go to the project selector page
Enable billing for your project.
Enable billing

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For details, see the "Cleaning up" section at the end of this tutorial.

Integration with Google Chat

This tutorial provides a sample integration to enable Google Cloud customers to forward alert notifications to their Google Chat rooms. The system architecture is as follows:

In the example, two monitoring alerting policies are created using Terraform: one is based on the GCE instance CPU usage_time metric and the other is based on the GCE instance disk read_bytes_count metric. Both alert policies use Cloud Monitoring Pub/Sub notification channels to send alert notifications. A Cloud Pub/Sub push subscription is configured for each Cloud Pub/Sub notification channel. The push endpoints of the Cloud Pub/Sub push subscriptions are pointed to the Cloud Run service we implement so that all the alert notifications sent to the Cloud Pub/Sub notification channels are forwarded to the Cloud Run service. The Cloud Run service is a simple Http server that transforms the incoming Cloud Pub/Sub messages into Google Chat messages and sends them to the configured Google Chat rooms via their incoming Webhook URLs.

All the infrastructure components are automatically created and configured using Terraform, which include:

Cloud Pub/Sub topics, push subscriptions, and service account setup.
Cloud Pub/Sub notification channels
Cloud Monitoring Alerting policies
Cloud Run service and service account setup.

The Terraform code can be found at ./tf-modules and ./environments.

Looking at the Cloud Run code

The Cloud Run service is responsible for delivering the Cloud Pub/Sub alert notifications to the configured Google Chat rooms. The integration code is located in the ./notification_integrationfolder.

In this example, a basic Flask HTTP server is set up in main.py to handle incoming Cloud Monitoring alert notifications from Cloud Monitoring Pub/Sub channels. We use Cloud Pub/Sub push subscriptions to forward the Pub/Sub notification messages to the Flask server in real time. More information on Cloud Pub/Sub subscription can be found in the Subscriber overview.

Below is a handler that processes the Pub/Sub message:

code_block: <ListValue: [StructValue([('code', '@app.route(\'/<config_id>\', methods=[\'POST\'])\r\ndef handle_pubsub_message(config_id):\r\n try:\r\n config_param = config_params_server.GetConfig(config_id)\r\n except BaseException as e:\r\n err_msg = \'Failed to get config parameters for {}: {}\'.format(config_id, e)\r\n logging.error(err_msg)\r\n return(f\'500: {err_msg}\', 200)\r\n if \'service_name\' not in config_param:\r\n err_msg = \'"service_name" not found in the config parameters: {}\'.format(config_id)\r\n logging.error(err_msg)\r\n return(f\'500: {err_msg}\', 200)\r\n if config_param[\'service_name\'] not in service_names_to_handlers:\r\n err_msg = \'No handler found for the service {}\'.format(config_param[\'service_name\'])\r\n logging.error(err_msg)\r\n return(f\'500: {err_msg}\', 200)\r\n \r\n handler = service_names_to_handlers[config_param[\'service_name\']]\r\n \r\n # Parse the Pub/Sub raw message to get the notification\r\n pubsub_received_message = request.get_json()\r\n try:\r\n notification = pubsub.ExtractNotificationFromPubSubMsg(pubsub_received_message)\r\n response, status_code = handler.SendNotification(config_param, notification)\r\n logging.info(f\'Notification was sent with the status code = {status_code}: {response}\')\r\n return(f\'{status_code}: {response}\', 200) \r\n except pubsub.DataParseError as e:\r\n logging.error(f\'Pubsub message parse error: {e}\')\r\n return(f\'400: {e}\', 200)\r\n except BaseException as e:\r\n logging.error(f\'Unexpected error when processing Pubsub message: {e}\')\r\n return(f\'400: {e}\', 200)'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c7d4fd90>)])]>

The handler calls the ExtractNotificationFromPubSubMsg() function in utilities/pubsub.py to parse the relevant notification data from the Pub/Sub message, and then loads the notification data into a dictionary. The output is a json object with the schema defined here.

code_block: <ListValue: [StructValue([('code', 'def ExtractNotificationFromPubSubMsg(pubsub_msg: Dict[Text, Any]) -> Dict[Text, Any]:\r\n """Parses notification messages from Pub/Sub.\r\n Args:\r\n pubsub_msg: Dictionary containing the Pub/Sub message.\r\n The message itself should be a base64-encoded string.\r\n Returns:\r\n The decoded \'data\' value of the provided Pub/Sub message, returned as a json dictory.\r\n Raises:\r\n DataParseError: If data cannot be parsed.\r\n"""\r\n try:\r\n data_base64_string = pubsub_msg[\'message\'][\'data\']\r\n except (KeyError, TypeError) as e:\r\n raise DataParseError(\'invalid Pub/Sub message format\') from e\r\n\r\n try:\r\n data_bytes = base64.b64decode(data_base64_string)\r\n except (binascii.Error, ValueError) as e:\r\n raise DataParseError(\'data should be base64-encoded\') from e\r\n except TypeError as e:\r\n raise DataParseError(\'data should be in a string format\') from e\r\n data_string = data_bytes.decode(\'utf-8\')\r\n data_string = data_string.strip()\r\n\r\n try:\r\n data_json = json.loads(data_string)\r\n except json.JSONDecodeError as e:\r\n raise DataParseError(\'data can not be loaded as a json object: {}\'.format(e), 400)\r\n\r\n return data_json'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c7d4f220>)])]>

This notification dictionary is then passed to SendNotification() which sends the notification along with config_params to the _SendHttpRequest(), in utilities/service_handler.py, which appropriately notifies the third-party service about the alert with an API client. There is a URL parameter “config_id”, which is the configuration ID used by the Cloud Run service to retrieve the configuration data “config_params”. “Config_params” includes all the needed parameters (e.g. HTTP URL and user credentials) for the Cloud Run service to forward the incoming notification to the third-party service. In this example, “config_id” corresponds to the Pub/Sub topics defined here.

You can modify this dispatch function to forward alerts to any third-party service.

code_block: <ListValue: [StructValue([('code', 'def _SendHttpRequest(self, config_params: Dict[str, Any], notification: Dict[Any, Any]) -> Tuple[httplib2.Response, Text]:\r\n """Sends a http request to a 3rd-party service via a http request."""\r\n http_url = self._GetHttpUrl(config_params, notification)\r\n messages_headers = self._BuildHttpRequestHeaders(config_params, notification)\r\n message_body = self._BuildHttpRequestBody(config_params, notification)\r\n http_obj = httplib2.Http()\r\n # content is a bytes object.\r\n http_response, content = http_obj.request(\r\n uri=http_url,\r\n method=self._http_method,\r\n headers=messages_headers,\r\n body=message_body,\r\n )\r\n return http_response, content.decode(\'utf-8\')'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c7d4f130>)])]>

Remember to acknowledge the Pub/Sub message on success by returning a success HTTP status code (200 or 204). See Receiving push messages.

All the logs written in the Cloud Run service can be easily accessed either from the Cloud Logging Logs Explorer or the Cloud Run UI. The logs are very useful for debugging the Cloud Run service. Moreover, users can create an extra pull subscription of the Pub/Sub topic used by the Cloud Pub/Sub notification channel to simplify the triage of notification delivery issues. For example, if some alert notifications were not delivered to users’ Google Chat room, users could first check if the pull subscription received the Cloud Pub/Sub messages of the missing alert notifications. If the pull subscription correctly received the missing alert notifications, then it means the alert notifications got lost in the Cloud Run service. Otherwise, it was the Cloud Pub/Sub notification channel issue.

Finally, there is a Dockerfile containing instructions to build an image that hosts the Flask server when deployed to Cloud Run:

code_block: <ListValue: [StructValue([('code', '# [START run_pubsub_dockerfile]\r\n \r\n# Use the official Python image.\r\n# https://hub.docker.com/_/python\r\nFROM python:3.8\r\n \r\n# Allow statements and log messages to immediately appear in the Cloud Run logs\r\nENV PYTHONUNBUFFERED True\r\n \r\n# Copy application dependency manifests to the container image.\r\n# Copying this separately prevents re-running pip install on every code change.\r\nCOPY requirements.txt ./\r\n \r\n# Install production dependencies.\r\nRUN pip install -r requirements.txt\r\n \r\n# Copy local code to the container image.\r\nENV APP_HOME /app\r\nWORKDIR $APP_HOME\r\nCOPY . ./\r\nARG PROJECT_ID\r\nENV PROJECT_ID=$PROJECT_ID\r\n# Flag to control what type of config server to use: in-memory or gcs.\r\nARG CONFIG_SERVER_TYPE\r\nENV CONFIG_SERVER_TYPE=$CONFIG_SERVER_TYPE\r\n \r\n# Run the web service on container startup. \r\n# Use gunicorn webserver with one worker process and 8 threads.\r\n# For environments with multiple CPU cores, increase the number of workers\r\n# to be equal to the cores available.\r\nCMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app\r\n \r\n# [END run_pubsub_dockerfile]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c7d4faf0>)])]>

Deploying the app

This section describes how to deploy and set up continuous integration using Cloud Build, Terraform, and GitHub, following the GitOps methodology. The instructions are based on Managing infrastructure as code with Terraform, Cloud Build, and GitOps, which also explains the GitOps methodology and architecture. Sections from the guide are also referenced in the steps below. An important difference is that this document assumes that separate Google Cloud projects are used for the dev and prod environments, whereas the referenced guide configures the environments as virtual private clouds (VPCs). As a result, the following deployment steps (with the exception of “Setting up your GitHub repository”) need to be executed for each of the dev and prod projects.

Set up your GitHub repository

To get all the code and understand the repository structure needed to deploy your app, follow the steps in Setting up your GitHub repository.

Deploy the Google Chat integration

Setting up webhook urls

Hardcoded Webhook URLs

We provided within main.py a config_map variable to store your webhook urls. You’ll first need to locate your Google Chat webhook url and replace the value for the key ‘webhook_url’ within the config_map dictionary.

code_block: <ListValue: [StructValue([('code', "config_map = {\r\n 'tf-topic-cpu': {\r\n 'service_name': 'google_chat',\r\n 'msg_format': 'card',\r\n 'webhook_url': '<YOUR_GOOGLE_CHAT_ROOM_WEBHOOK_URL>'},\r\n 'tf-topic-disk': {\r\n 'service_name': 'google_chat',\r\n 'msg_format': 'card',\r\n 'webhook_url': '<YOUR_GOOGLE_CHAT_ROOM_WEBHOOK_URL>'}\r\n}"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c7d4fe50>)])]>

Manual GCS Bucket Webhook URLs

Alternatively if you’d like to have a more secure option to store your webhook urls, you can create a GCS bucket to store your webhook urls.

Locate and store your Google Chat webhook url for your gchat rooms in a json file named config_params.json in the format of:

{“topic”: “webhook url”, “topic”: “webhook url”}

Create a Cloud Storage bucket to store the json file with the name gcs_config_bucket_{PROJECT_ID}.

You can also run this command in the cloud console: gsutil mb gs://gcs_config_bucket_{PROJECT_ID}

Grant the read permissions (Storage Legacy Bucket Reader and Storage Legacy Object Reader) to the default Cloud Run service account <PROJECT_NUMBER>-compute@developer.gserviceaccount.com

To deploy the notification channel integration sample for the first time automatically, we’ve provided a script deploy.py that will handle a majority of the required actions for deployment. After completing the webhook url step above run the following command:

Python3 deploy.py -p <PROJECT_ID>

Manual Deployment

To deploy the notification channel integration manually, you’ll have to complete the following steps:

1. Set the Cloud Platform Project in Cloud Shell. Replace <PROJECT_ID> with your Cloud Platform project id:
gcloud config set project <PROJECT_ID>

2. Enable the Cloud Build Service:
gcloud services enable cloudbuild.googleapis.com

3. Enable the Cloud Resource Manager Service:
gcloud services enable cloudresourcemanager.googleapis.com

4. Enable the Cloud Service Usage Service:
gcloud services enable serviceusage.googleapis.com

5. Grant the required permissions to your Cloud Build service account:
CLOUDBUILD_SA="$(gcloud projects describe $PROJECT_ID --format 'value(projectNumber)')@cloudbuild.gserviceaccount.com"

gcloud projects add-iam-policy-binding $PROJECT_ID --member serviceAccount:$CLOUDBUILD_SA --role roles/iam.securityAdmin

gcloud projects add-iam-policy-binding $PROJECT_ID --member serviceAccount:$CLOUDBUILD_SA --role roles/run.admin

gcloud projects add-iam-policy-binding $PROJECT_ID --member serviceAccount:$CLOUDBUILD_SA --role roles/editor

6. Create Cloud Storage bucket to store Terraform states remotely:
PROJECT_ID=$(gcloud config get-value project)
gsutil mb gs://${PROJECT_ID}-tfstate

7. (Optional) You may enable Object Versioning to keep the history of your deployments:
gsutil versioning set on gs://${PROJECT_ID}-tfstate

8. Trigger a build and deploy to Cloud Run:
If you used the in-memory config server, run (replace <BRANCH> with the current environment branch)
gcloud builds submit . --config cloudbuild.yaml --substitutions BRANCH_NAME=<BRANCH>,_CONFIG_SERVER_TYPE=in-memory

If you use the GCS based config server, run:
gcloud builds submit . --config cloudbuild.yaml --substitutions BRANCH_NAME=<BRANCH>,_CONFIG_SERVER_TYPE=gcs

Continuous Deployment setup

This is an optional flow and this section describes how to set up continuous deployment using Cloud Build through the use of triggers. The flow is demonstrated in the following diagram: every time users push a new version to their Git repository, it will trigger the Cloud Build trigger; the Cloud Build will run the YAML file to rebuild the Cloud Run docker image, update the infrastructure setup, and redeploy the Cloud Run service.

The instructions are based on Automating builds with Cloud Build.

Set up a code repository, this could be GitHub, Google Cloud Source repository or any private repository.

Clone the repository from our GitHub.
Switch to the new project and push the cloned repository to the remote repository.

code_block: <ListValue: [StructValue([('code', 'gcloud init && gconfig –global credential.https//source.developers.google.com.helper gcloud.sh \r\n\r\ngit remote add <connection name> <repo url>\r\n\r\ngit push <connection name>'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c7d4f4f0>)])]>

Next we create a new trigger in Cloud Build.

Cleaning up

If you created a new project for this tutorial, delete the project. If you used an existing project and wish to keep it without the changes added in this tutorial, delete resources created for the tutorial.

Delete the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

Deleting a project has the following effects:

Everything in the project is deleted. If you used an existing project for this tutorial, when you delete it, you also delete any other work you've done in the project.
Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an appspot.com URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple tutorials and quickstarts, reusing projects can help you avoid exceeding project quota limits.

To delete a project, do the following:

In the Cloud Console, go to the Manage resources page.
Go to the Manage resources page
In the project list, select the project that you want to delete and then click Delete.
In the dialog, type the project ID and then click Shut down to delete the project.

Delete tutorial resources

Delete the Cloud resources provisioned by Terraform:
terraform destroy
Delete the Cloud Storage bucket called {PROJECT_ID}-tfstate.
Delete permissions that were granted to the Cloud Build service account:
gcloud projects remove-iam-policy-binding $PROJECT_ID --member serviceAccount:$CLOUDBUILD_SA --role roles/iam.securityAdmin
gcloud projects remove-iam-policy-binding $PROJECT_ID --member serviceAccount:$CLOUDBUILD_SA --role roles/run.admin
gcloud projects remove-iam-policy-binding $PROJECT_ID --member serviceAccount:$CLOUDBUILD_SA --role roles/storage.admin
Delete permission for the service account to publish to tf-topic:
gcloud pubsub topics remove-iam-policy-binding \ projects/[PROJECT_NUMBER]/topics/tf-topic --role=roles/pubsub.publisher \ --member=serviceAccount:service-[PROJECT_NUMBER]@gcp-sa-monitoring-notification.iam.gserviceaccount.com
Delete the notification channel that uses tf-topic.
Delete your forked GitHub repository notification_integration.
Disconnect the GitHub repository from Cloud Build by deleting the Cloud Build triggers.
Disable Google Cloud APIs.

Expanding to other 3rd-party services

The sample code in the tutorial provides a generic framework and can be easily customized for Google Cloud customers to deliver alert notifications to any 3rd-party services that provide Webhook/Http API interfaces.

To integrate with a new 3rd-party service, we can create a new derived class of the abstract class HttpRequestBasedHandler defined in ./notification_channel/service_handler.py and updated the following member functions:

CheckConfigParams(): A function that checks if a given integration configuration is valid, e.g. a required API key is given.
_GetHttpUrl(): A function to get the Http url (where to send Http requests) from the configuration data
_BuildHttpRequestHeaders(): A function that constructs the Http request header.
_BuildHttpRequestBody(): A function that constructs the Http request message body based on the incoming Cloud Pub/Sub message.
SendNotification(): You can reuse the one defined in the GchatHandler class.

There is no need to update the Terraform code, except you need to customize your alert policies. If you have additional suggestions, community feedback is always welcome. Please submit pull requests to continue to build the GitHub repository together.

Webhook, Pub/Sub, and Slack Alerting notification channels launched

Wed, 19 Jan 2022 17:00:00 +0000

When an alert fires from your applications, your team needs to know as soon as possible to mitigate any user-facing issues. Customers with complex operating environments rely on incident management or related services to organize and coordinate their responses to issues. They need the flexibility to route alert notifications to platforms or services in the formats that they can accept.

We’re excited to share that Google Cloud Monitoring’s Webhooks, Pub/Sub, and Slack notification channels for alerting are now Generally Available (GA). Along with our existing notification channels of email, SMS, mobile, and PagerDuty (currently in Beta), Google Cloud alerts can now be routed to many widely used services. These new notification channels can be used to integrate alerts with the most popular Collaboration, ITSM, Incident Management, and virtually any other service or software that support Webhooks or Pub/Sub integration.

You can configure your Google Cloud alerts to be sent to any vendor or custom-built tool used by your team. For example, your GKE cluster uptime checks can send the alert data to a 3rd party communication tool via the pub/sub notification channel. Or if you’re tracking security concerns such as unexpected IP addresses, you can send a log-based alert to your incident management provider.

How to Configure Webhook, Pub/Sub, or Slack Notifications

For custom integrations, Pub/Sub is the recommended approach for sending notifications to a private network. Webhooks are supported for public endpoints and are available with basic and token authentication. Both of these notification channels can be enabled programmatically through an automation tool like Terraform.

If you’re using Slack, you can enable Cloud Monitoring access to your Slack channel/workspace and then create the notification channel. If you'd like to automate Slack channel notification deployments, you'll need to create and install your own Slack app and reuse the OAuth token instead of using the Google Cloud Monitoring app.

What’s Next

If you’d like to learn more, check out our example tutorial blog on how to send pub/sub notifications to external vendors using Cloud Run and Cloud Build. Please feel free to share your comments and feedback with us in the Google Cloud Community.

Patterns for better insights and troubleshooting with hybrid cloud logs

Tue, 18 Jan 2022 17:00:00 +0000

Hybrid and multi-cloud environments produce a boundless array of logs including application and server logs, logs related to cloud services, APIs, orchestrators, gateways and just about anything else running in the environment. Due to this high volume, logging systems may become slow and unmanageable when you urgently need them to troubleshoot an issue, and even harder to use them to get insights.

Google Cloud's operations suite plays a vital role in any application modernization framework and is essential to assuring a reliable and secure application, providing monitoring, logging and alerting — the baseline of SRE and Google’s holistic approach to operations.

As customer engineers, we see many organizations with hybrid and multi-cloud applications who want to integrate logs and metrics from various sources into a single console. Metrics of all critical services can be collected not just for daily operations but also to measure internal and external SLIs, SLOs and SLAs of modern applications.

Improving customer operations

We recently worked with two customers - a large media-processing customer and a large telecom provider -- that have applications running on Google Cloud, other clouds and on-premises. Each customer faced issues with their logs:

Customer 1, large media-process company: their existing logging environment was too slow to effectively support real-time troubleshooting using system and application logs from across all their environments, and
Customer 2, large telecom provider: they did not feel they were getting valuable insights from the many terabytes of network logs they were ingesting per day. These insights did not need to be real-time.

Both customers used self-managed, popular open source products for ingesting, storage and retrieval. But as the volume of their logs grew, the cost of their infrastructure and operational overhead to support logs rose too. They shared other characteristics:

Both setups required SSD storage as well as large VMs for ingestion pipelines, storage and retrieval.
Both faced risk due to dependencies on a single resource for elements of their log management.
And finally, both had data pipeline queues piling up, which for one of them meant they were not getting the logs when they needed them most, when application/infrastructure failures impacted the SLA of their products and services.

They needed to reduce costs, reduce failures and determine the volume of logs they need to ingest and store to get analytical insights from them. We looked at different patterns and proposed the following two options:

Customer 1: Route system and application logs from their hybrid and multi-cloud services to Cloud Logging for real-time troubleshooting at scale, reducing cost and operational burden.
Customer 2: Route their network logs directly to BigQuery, so they could manage costs and get better insights from their data.

Let’s take a deeper look at the two patterns:

Pattern 1: Cloud Logging for resource management and troubleshooting

This pattern fits well in the scenarios where the primary objective is to troubleshoot based on real time logs. Here the main focus is on important logs & metrics, while logs that are not needed for real-time troubleshooting are sampled and filtered as needed. Customer 1 had a large volume of logs, so scale and timely ingestion were critical for troubleshooting problems. To help them meet their objective, we proposed the following pattern:

In this pattern, the customer uses Cloud Logging to collect the logs from Google Cloud, other clouds and VMs. Google Cloud resources such as Google Kubernetes Engine (GKE), Compute Engine with the Ops Agent, and Cloud Storage automatically send logs to Cloud Logging, while logging agents such as fluentd and stanza brings in application and system logs from other sources such as other clouds and on-prem systems (Additionally, partner tools such as BlueMedora can be used to bring logs from a wide range of other sources including Azure Kubernetes Service).

Logging agents collect and send the logs using Cloud Logging API to the Logs Router, where you can apply filters to capture only the important logs. Our customer’s hybrid deployment was generating 20 TB of logs every day, so we identified logs which were not important for real-time troubleshooting (in their case detailed network logs and debug logs) and applied filters at both the agent and Logs Router level. Further, we advised them that they could export their network logs to BigQuery or other third party tools for future analysis for deriving patterns and insights.

This pattern suited the customer’s requirements as their use case was primarily focused on troubleshooting and detailed network logs were only needed for analytics. It had the following advantages

Fully managed so they could focus on app development
Cost effective with combination of Cloud Logging and BigQuery
Provided the basis for future self healing operations with logs-based metrics

Pattern 2: BigQuery for log analytics in hybrid and multi-cloud scenarios

This pattern fits well in Customer 2’s scenario where logs volumes are high and leveraged primarily for analytics. In this customer’s case, they wanted deeper insights into their network logs.

This pattern is recommended for customers that need to run models for anomaly detection, pattern recognition, etc., specifically related to their network logs. Network logs tend to be high volume, and since these logs were primarily used for analytics, the customer did not really benefit from capabilities like sorting, filtering, dashboarding, etc. This is where BigQuery was useful, with low costs, built-in AI/ML, and global scale. Google's fully managed data warehouse and analytical engine performed queries on terabytes of logs in tens of seconds making it ideal for the customer’s advanced analytics needs.

In this pattern, the Log Collector agent uses an output plugin that configures BigQuery as a destination for storing the logs collected from hybrid cloud, other cloud providers and from Google Cloud. Using the plugin, the customer can directly load logs into BigQuery in near-real-time from many servers. BigQuery automatically creates the schema for incoming logs, giving the customer more control over defining the schema and data format for the metrics. Once the logs are in BigQuery, the customer can make use of Looker to perform basic RegEx and build machine learning models on top of it. The customer can also visualize their log data by creating a dashboard that's updated frequently using Data Studio, Looker or any visualization tool.

This pattern provides the following advantages:

A cost-effective solution for high-volume networking logs
Built-in analytics engine for log insights
Use of the BigQuery API to display data to a dashboard and consume analytical insights

Final Thoughts

Google Cloud’s operations suite is easy to use right out of the box for Google Cloud users, but as we demonstrate for our customers everyday, it supports a variety of other use cases. As shown in the two patterns above, if you did need to troubleshoot with a subset of the logs and need analytical capabilities with another subset of your logs, you can use the Cloud Logging API to send the first part of your logs to Cloud Logging and the other part of your logs directly to BigQuery!

Furthermore, there is work ongoing to simplify many of these tasks. For example, we recently released the preview of Log Analytics, which automatically imports logs from Google Cloud services into BigQuery, and gives you the ability to analyze that data directly from the Cloud Logging interface.

Get started today

If you want to achieve the maximum benefit from your logs without the cost and operational overhead, set up time with your account team today or click here to contact our sales team.

If you have any questions or topics that you want to discuss with the operations community at Google Cloud, please visit our Google Cloud Community site.

How to deploy the Google Cloud Ops Agent with Ansible

Wed, 12 Jan 2022 17:00:00 +0000

Site Reliability Engineering (SRE) and Operations teams responsible for operating virtual machines (VMs) are always looking for ways to provide a more reliable, more scalable environment for their development partners. Part of providing that stable experience is having telemetry data (metrics, logs and traces) from systems and applications so you can monitor and troubleshoot effectively. Many Google Cloud services, including Google Compute Engine, provide basic system metrics out of the box. However, if you want in-depth metrics about your VMs or application telemetry, installing the Google Cloud Ops Agent is necessary.

At Cloud Ops we make it easy to install the Ops Agent in our UI on one or a handful of VMs, but installing, configuring, and managing an agent on a fleet of VMs, especially when many are hosting production workloads at an enterprise organization can be incredibly taxing. There are simply too many configuration and provisioning tools and often simply too much complexity. In that vein, we at Cloud Operations want to meet our users where they are in their process of digital transformation. That’s why we’ve introduced support for the most common automation tools in the configuration and provisioning space to deploy the Cloud Ops Agent. This lets our users prioritize automation as a way to reduce operational toil so they can focus on building and managing reliable and highly performant infrastructure.

Today we’ll be taking a look at how to deploy the Cloud Ops agent in an automated fashion across a fleet of VMs, and in this example we’ll use Ansible. Ansible is a popular open source configuration management tool that provides a lightweight way to get started automating your infrastructure. We’ll also look at a more advanced example, using some templating tools available to streamline your automation code. But first let's talk a little about what Ansible is, and how it works.

What is Ansible, and how does it work?

Ansible is an open source tool written in Python which provides an agentless framework for connecting and interacting with machines. To do this it leverages the native connection protocols for Linux and Windows, SSH and Powershell respectively. The key benefit of using existing connection protocols is that it helps to reduce overhead on the systems, while benefiting from the security of these longstanding and heavily adopted protocols. When working with Ansible, one of the simplest units of work is a playbook:

code_block: <ListValue: [StructValue([('code', '---\r\n- name: Sample playbook\r\n hosts: localhost\r\n tasks:\r\n - ansible.builtin.debug:\r\n msg: "Hello World!"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c6c484c0>)])]>

This really simple playbook runs against your localhost, and executes a task essentially equivalent to echoing “Hello World!”

Deploying the Ops Agent to monitor and troubleshoot VMs

The new Google Cloud Ops Agent makes it really easy to immediately start collecting telemetry data from your systems at a high level. By simply installing the agent we can immediately ingest standard system logs and additional telemetry about the system beyond the defaults, including running processes.

Adding workload specifics to your configuration

Now let’s take a look at a more complex example, like a playbook that will deploy Nginx and a custom configuration for the Ops Agent to collect telemetry.

Here’s what the simple custom configuration file looks like for the Ops Agent, to collect default metrics and logs from Nginx, also written in YAML format:

code_block: <ListValue: [StructValue([('code', 'logging:\r\n receivers:\r\n nginx_default_access:\r\n type: nginx_access\r\n nginx_default_error:\r\n type: nginx_error\r\n service:\r\n pipelines:\r\n nginx:\r\n receivers:\r\n - nginx_default_access\r\n - nginx_default_error\r\nmetrics:\r\n receivers:\r\n nginx_metrics:\r\n type: nginx\r\n stub_status_url: http://127.0.0.1:80/status\r\n collection_interval: 60s\r\n service:\r\n pipelines:\r\n nginx_pipeline:\r\n receivers:\r\n - nginx_metrics'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c6c48b20>)])]>

And here’s a playbook, specifying the custom `ops_agent.yaml` configuration file in the role:

code_block: <ListValue: [StructValue([('code', '---\r\n- name: Deploy and configure Cloud Ops Agent\r\n hosts: all\r\n become: true\r\n roles:\r\n - role: googlecloudplatform.google_cloud_ops_agents\r\n vars:\r\n agent_type: ops-agent\r\n version: 1.0.1\r\n main_config_file: ops_agent.yaml\r\n notify:\r\n - Restart Ops Agent\r\n\r\n tasks:\r\n - name: Install nginx\r\n ansible.builtin.package: \r\n name: nginx\r\n state: present\r\n\r\n - name: Customize nginx config for telemetry\r\n ansible.builtin.template:\r\n src: ansible_templates/status.conf\r\n dest: /etc/nginx/conf.d/status.conf\r\n notify:\r\n - Restart Nginx\r\n\r\n\r\n - name: Start nginx\r\n ansible.builtin.service:\r\n name: nginx\r\n state: started\r\n enabled: yes\r\n\r\n - name: Start Ops Agent\r\n ansible.builtin.service:\r\n name: google-cloud-ops-agent\r\n state: started\r\n enabled: yes\r\n \r\n handlers:\r\n - name: Restart Nginx\r\n ansible.builtin.service:\r\n name: nginx\r\n state: restarted\r\n enabled: yes\r\n\r\n - name: Restart Ops Agent\r\n ansible.builtin.service:\r\n name: google-cloud-ops-agent\r\n state: restarted\r\n enabled: yes'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c6c488e0>)])]>

After running this playbook we should have successfully installed NGINX in all hosts within our inventory, and should be submitting both metrics and data from Nginx! To copy the example playbook check out this GitHub sample.

Now it’s time to visualize some of this information! We provide an out of the box dashboard for Nginx, that you can import like so:

And that’s it! Now we can see the metrics we’ve been collecting from Nginx with the Cloud Ops Agent

Get started today

Whether you are managing a handful of VMs or an entire fleet, ensuring robust observability data is available from systems and applications is key to effective monitoring and troubleshooting. With the VM Instances dashboard in Cloud Monitoring, Agent Policies, or use of open source tooling such as Ansible, Chef, Puppet and Terraform, you have many options to install agents on your Google Cloud VMs. The Ops Agent helps you gather data to keep your infrastructure and applications performing their very best, and automating the deployment makes day to day management all that much easier.

If you’d like to watch a video where I walk through these steps, check out our YouTube video that demonstrates this blog post, and see the rest of our O11y In Depth playlist!

Or if you’d like to get started with a tutorial, you can also use our Cloud Ops Agent tutorial for Ansible to walkthrough a simple deployment in Google Cloud Shell.

Lastly, if you have feedback or want to ask us questions, drop us a line on the Google Cloud Community Cloud Ops area!

How Sabre is using SRE to lead a successful digital transformation

Mon, 22 Nov 2021 17:00:00 +0000

Editor’s note: Today we hear from Kenny Kon, an SRE Director at Sabre. Kenny shares about how they have been able to successfully adopt Google’s SRE framework by leveraging their partnership with Google Cloud.

As a leader in the travel industry, Sabre Corporation is driving innovation in the global travel industry and developing solutions that help airlines, hotels, and travel agencies transform the traveler experience and satisfy the ever-evolving needs of its customers.

In order to build these solutions, we joined forces with Google Cloud as our preferred cloud provider to accelerate our digital transformation. We chose Google because they understand the industry we are in as they also manage travel products such as Google Travel. Google also created SRE (Site Reliability Engineering), and operates with SRE principles at the Google scale, which is what intrigued us the most.

Initially we started with a multi-cloud model, but that didn’t help us move faster so we consolidated to just Google Cloud. To speed our transformation along, we have adopted Google SRE (Site Reliability Engineering) practices which enables us to balance reliability and speed. We have been able to make this transformation with the direct help of Google Cloud’s Professional Services Organization (PSO) along with Google Cloud’s tooling, like Cloud Monitoring and Cloud Logging, and operating on Google Kubernetes Engine (GKE), and Cloud Spanner.

In adopting SRE at Sabre, we’d like to highlight three key takeaways from the journey:

1. Find colleagues who are also passionate about shifting culture and adopting SRE

Create a community within your organization who is dedicated to the SRE journey and motivated to make things happen. As we adopted SRE at Sabre I saw more and more people rallying and coming together to support the culture change. With some momentum built it was great to bring shared experiences to the team as we all spoke in the same language talking about SLOs, SLIs, and about how we measure things.

Some of the ways in which we built our community was by hosting monthly brown bag sessions. This is an informal gathering where teams come in and share their experiences and challenges, or teach on specific SRE topics such as SLOs or toil. We also created a public Google Developer Group (GDG) and have hosted several Google SRE subject matter experts to speak on SRE principles and best practices.

2. Get your mid level leadership stakeholders on board

We know how important getting leadership buy in is to creating a successful SRE movement within an organization. That top-level buy-in is highly important to get resources and drive transformation across the organization, but what is sometimes missed is making it a priority to get mid-level leadership on board as well. It’s difficult to enact change from the ground up starting with practitioners at the bottom, and it’s also difficult to just have leadership buy in, as once it gets down to the middle, things may fall apart. It is imperative to have mid-level leaders on board as well, as they directly affect the culture and decisions of their teams. To avoid resistance, it is also important that the mid-level leadership (product, operations and engineering managers), i.e. people managers, will understand the motivations behind change so they will be onboard. Without that understanding, it will hinder mid-level leadership’s ability to communicate changes to the practitioners level and can impact the teams' goal and allocated bandwidth.

3. Don’t be afraid to get help from professionals

Adopting SRE at a large organization is no simple feat. Partnering with Google’s SRE consulting experts has brought about a huge shift at Sabre. The value PSO brings is not just training, it's also listening. We’ve had experienced Googlers who understand our problems and have been at our stage in the SRE journey listen, analyze and tailor the approach specific to our team's goals. PSO helped us by shifting our engineering teams to be more customer centric, and aligning our product, operations, and development teams. But most importantly, they’ve helped to make our current teams happier, because they're not spinning their wheels, waiting around on blocked requests.

When we partnered with PSO we were aware of who the key stakeholders in our organization are: the mid-level leadership and people managers. We made sure to bring them into our PSO discussions and decision making sessions and as a result, helped us to get more traction and solve the gap we had, enabling the middle-level and bringing them on board.

Some of the actions we have taken with help from our PSO SRE partners include adding a tiers of service approach, improving incident management through wheels of misfortune (WoM), defining critical user journeys (CUJs), and implementing error budgets.

Since putting these SRE practices into place, our business is more aligned to customer experience. We now invest org resources according to the needs of our customers and with that have reduced silos across our teams. Our Ops team is much happier since they can move faster and not have to block requests. SRE has taught us a common language, a common framework. Moreover, it gives this whole discipline a culture and meaning.

Get planet-scale monitoring with Managed Service for Prometheus

Mon, 15 Nov 2021 17:00:00 +0000

Editor’s note, March 01, 2022: We are pleased to announce that Managed Service for Prometheus has now reached General Availability! For more information, see our announcement blog or visit the website.

Prometheus, the de facto standard for Kubernetes monitoring, works well for many basic deployments, but managing Prometheus infrastructure can become challenging at scale. As Kubernetes deployments continue to play a bigger role in enterprise IT, scaling Prometheus for a large number of metrics across a global footprint has become a pressing need for many organizations. Today, we’re excited to announce the public preview of Google Cloud Managed Service for Prometheus, a new monitoring offering designed for scale and ease of use that maintains compatibility with the open-source Prometheus ecosystem.

With Google Cloud Managed Service for Prometheus, you have an alternative to taking on the toil of self-managing a Prometheus or Thanos stack in perpetuity. Instead, you can get global and globally scalable monitoring with Prometheus interfaces, letting you keep open-source ecosystem compatibility and portability.

Details about Google Cloud Managed Service for Prometheus, currently in preview

Managed Service for Prometheus is Google Cloud's fully managed collection, storage, and query service for Prometheus metrics. It is built on top of Monarch, the same globally scalable data store that powers all application monitoring at Google. Managed Service for Prometheus lets you monitor and alert on your Kubernetes deployments using Prometheus without having to manually manage and operate Prometheus infrastructure at scale. The service is built as a drop-in replacement for Prometheus and eliminates the need for self-managed solutions like Thanos or Cortex.

Google Cloud Managed Service for Prometheus is built on Monarch, which powers all application monitoring at Google

An easy to use service with open source interfaces

As a drop-in replacement for Prometheus, Managed Service for Prometheus was designed to have easy onboarding by letting you reuse your existing Prometheus configs. You can also opt to deploy managed collectors for even greater simplicity. It has hybrid- and multi-cloud compatibility, meaning you can monitor any environment where Prometheus can run.

Looking beyond data collection, you can keep your existing dashboards in Grafana and PromQL-based rules and alerts without modifying any queries. This means you maintain portability with open source-compatible interfaces that proprietary managed solutions usually do not support.

Managed Service for Prometheus is built on the same global and globally scalable backend that Google uses for its own metrics. Collecting over 2 trillion active time series holding 65 quadrillion points, the service can support practically any metric volume that your business produces. The system supports ad-hoc global aggregations at query time over regionally-stored raw metric data. Plus, you get 2-year data retention by default, at no extra cost.

Being on the same backend as the rest of Google Cloud’s monitoring services means Managed Service for Prometheus is compatible with Cloud Monitoring. In fact, free Google Cloud platform metrics can be queried alongside your Managed Service for Prometheus metrics within Cloud Monitoring.

Customers already seeing success in preview

When we first announced the preview at Next 2021, we were joined by Bartosz Jakubski from OpenX, who described how Managed Service for Prometheus was supporting the growth of their business by offloading the management of Prometheus and Thanos.

Horizon Blockchain Games is another customer that is previewing the service, and they’re already seeing benefits. "We have been running Prometheus ourselves for GKE metrics, but the ongoing maintenance took up too many development hours and we did not want to deal with scaling our metrics infrastructure with our growing business,” said Peter Kieltyka, CEO and Chief Architect. “We started using Google Cloud Managed Service for Prometheus and it just works. It can handle whatever volume we have because it's built on the same backend that Google uses itself, and we get to keep using the same Grafana dashboards as before while keeping open standards and protocols."

How to get started

Configuring ingestion for Managed Service for Prometheus is straightforward. Just swap out your existing Prometheus binaries with the Managed Service for Prometheus binary, and set up a new data source so you can configure Grafana to read your metrics. See Managed Service for Prometheus documentation to get started.

Enabling SRE best practices: new contextual traces in Cloud Logging

Wed, 10 Nov 2021 17:00:00 +0000

The need for relevant and contextual telemetry data to support online services has grown in the last decade as businesses undergo digital transformation. These data are typically the difference between proactively remediating application performance issues or costly service downtime. Distributed tracing is a key capability for improving application performance and reliability, as noted in SRE best practices. Today, we’re making it easier for you to understand what is happening within your applications by making trace information available directly in Cloud Logging.

Tracing provides critical insight into the overall performance of applications running in a distributed architecture by stitching together relevant information about events from the propagated request’s point of view. These events are referred to as spans, and they are the building blocks for trace objects.

Faster insights and correlations with logs and traces

Distributed tracing offers the unique ability to reduce Mean Time To Repair (MTTR) by correlating log information with the sources latency in a distributed system. This capability is especially critical when users have workloads running in, or interacting with, distributed compute environments like Google Kubernetes Engine (GKE).

When your applications are instrumented to generate structured log outputs especially with Google Cloud Logging libraries and OpenTelemetry to generate traces, trace facets will automatically appear on log lines within Cloud Logging. This makes it easy for you to quickly understand causally related events.

To illustrate this capability, consider the simplicity of troubleshooting the situation below. This is a Cloud Run instance making search invocations to a GKE cluster and then to a database layer managed by Cloud SQL. The application performing the invocation in Cloud Run is deployed using Go and the middle tier in GKE is deployed using Python (Flask).

In this example, a member of the support staff observes a notification in the activity log of their Google Cloud console, that their microservices-based application has slowed down considerably. One typical way of troubleshooting this is to dig through all the solutions logs for that timeframe to find the root cause. However, if the operations team has instrumented all workloads to generate traces, the application owners can use that information to narrow down which service is the source of the issue. After identifying the lagging service, they can follow up with the service owners to troubleshoot, drastically reducing MTTR.

The video capture below showcases the integration of trace information in the Logs Explorer of the Cloud Logging product:

How traces are generated in Google Cloud services

To create the trace in the sample above, the Cloud Trace backend stitched together all the spans that were generated as the request propagated through the different Google Cloud services (Cloud Run, GKE and Cloud SQL). Then it surfaced that data into the Logs Explorer in Cloud Logging. A summary of how spans were created in each service is below:

Cloud Run: the generated spans are an out-of-the-box (OOTB) feature and are representative of the ingress and egress out of the preceding load balancers and Cloud Run compute instances.
GKE pods: the Python Flask application generates spans as a result of the developer implementing the OpenTelemetry Flask Instrumentor into their application.
Cloud SQL: spans are generated automatically for its query execution time when the SQL statements are augmented with Sqlcommenter.

A sample of the resulting trace hierarchy embedded in the log line is shown below.

Get started today

To view traces in Cloud Logging, you need to first instrument your applications running on Google Cloud to generate structured log outputs and traces. GKE will automatically capture logs written to stdout and stderr or you can use our Google Cloud Logging libraries to use the Cloud Logging API. To capture traces, we recommend instrumenting your applications with OpenTelemetry. Check out this codelab to experience instrumenting an application with OpenTelemetry and sending the traces to Cloud Trace.

If you have questions or want to provide feedback, please visit our Google Cloud Community page and leave a comment.

Google Cloud Monitoring 101: Understanding metric types

Mon, 18 Oct 2021 16:00:00 +0000

Whether you are moving your applications to the cloud or modernizing them using Kubernetes, observing cloud-based workloads is more challenging than observing traditional deployments. When monitoring on-prem monoliths, operations teams had full visibility over the entire stack and full control over how/what telemetry data is collected (from infrastructure to platform to application data). In cloud-based applications, aggregating, integrating, and analyzing telemetry for full visibility is more complex because:

Data originates from more sources: in addition to telemetry data from your application/workload components, there is a need to integrate telemetry data from the cloud infrastructure (VM, Load balancers etc.), cloud platform (Kubernetes, docker etc.), and cloud services (storage, databases etc.).
Systems have different levels of visibility: the variety of sources from which telemetry data is collected provides different mechanisms for data collection. Some expose this data through APIs, others require users to install agents. While some services push data to an endpoint, others require users to pull data from an endpoint.
The granularity of data collection, retention, and aggregation can differ: for some sources, data can be collected with fine granularity (seconds) while other sources still only expose data at coarser granularity (minutes). Some telemetry data is retained for short periods of time (days) while other data is retained for much longer periods (years).

In addition to these technical dimensions, the cost of running an observability solution can be a concern. Given that you have control over only some of the metric data you can collect, you may ask yourself: “What telemetry data do I have to pay for and what telemetry data is available at no cost as a part of the service I am using?”

While observability encompasses different signals, including metrics, events, traces, and logs, this blog focuses on collecting metrics in Google Cloud. This blog will describe three aspects of metric collection in Google Cloud:

The variety of types of metrics that are collected.
How different types of metrics are collected.
Which metrics are chargeable and which are non-chargeable.

We will explore these aspects through three offerings: Google Compute Engine (GCE), Google Kubernetes Engine (GKE) and Google BigQuery (BQ). These three examples represent the different levels of control users have on the level of visibility into Google Cloud Services.

GCE: almost full control over deploying agents to collect metrics
GKE: some control over deploying agents to collect metrics
BQ: no control over deploying agents to collect metrics

Types of Metrics

There are many types of metrics collected across these three classes of services, but they can be broadly grouped into four categories: System metrics, Agent metrics, User-defined metrics and Logs-based metrics.

1. System metrics:
System metrics are instrumented and collected by Google Cloud to provide visibility into how platform-managed services are behaving. You do not need to deploy an agent to collect them, and they are automatically sent to Google Cloud Monitoring.

Depending on the usage context, these metrics may be referred to by different names, but broadly speaking they are all grouped into System Metrics. System Metrics are also commonly called Google Cloud Metrics, GCP Metrics, “built-in” metrics, system-defined metrics, platform metrics, or Infrastructure metrics. Different service types may also refer to them with different terms:

Infrastructure as a service (IaaS) users may refer to them as infrastructure metrics.
Platform as a service (PaaS)/containers as a service (CaaS) users may refer to them as platform metrics.
Software as a service (SaaS) users may refer to them as service metrics.

Regardless of the usage context, they are all “built-in metrics.” When the usage context is for a specific Google Cloud service, sometimes you also see them referred to as:

Kubernetes metrics: collected from GKE. Older versions of GKE called them container metrics. Metric names of this type are prefixed with kubernetes.io. These are resource metrics for containers, pods, and nodes in your Kubernetes cluster.

Anthos metrics: collected from Anthos on-prem and Anthos on bare metal. Metrics of this type are prefixed with kubernetes.io/anthos.

Istio metrics: collected from Istio on Google Kubernetes Engine. Metrics of this type are prefixed with istio.io.

Knative metrics: collected from Knative on Google Kubernetes Engine. Metrics of this type are prefixed with knative.dev.

2. Agent metrics:
Agent metrics refer to a broad set of metric types. As the name suggests, Agent Metrics require you to install an agent (either the Cloud Monitoring agent or the unified Ops Agent) for metric collection. Agent-based metrics do not require application developers to instrument metrics and are available as pre-packaged receivers/collectors that need to be configured in the agent. User-installed agents collect metrics of the following types:

Resource metrics: metrics about any resource, including compute, network, or storage for a virtual machine (VM). These resources could be Google Cloud managed (like a GCE VM), customer managed (e.g an on-prem host) or a resource on a different cloud (AWS VM).

Process metrics: resource metrics provide high level visibility, at the VM-level, for example. Process metrics are fine grained and include measurements such as CPU, memory, I/O, number of threads, and more for specific processes (like a data backup process) running in VMs.

Third-party metrics: metrics about any third-party or open source software running in a VM (GCE or somewhere else) or a container (Nginx, Kafka, MySQL etc.) These metrics provide purpose-built visibility into the internal operations of these software components.

Note: the Ops Agent can also collect metrics about itself, prefixed by agent.googleapis.com/agent. Metrics collected by the agent about other software components are prefixed by agent.googleapis.com/<name of component>

3. User-defined metrics:
User-defined metrics provide visibility into your deployed applications or workloads and are defined and instrumented by you. User-defined metrics can include:

Custom metrics: custom metrics can be ingested either by using the client libraries, the Cloud Monitoring API, or by deploying the Ops Agent to collect metrics and then ingest them into Cloud Monitoring. They are identified with the prefix custom.googleapis.com.

Workload metrics: workload metrics encompass a wide range of data produced by applications running on your resources. Whether these applications are monoliths, containers, or data processing ETL jobs, you have to instrument your code to generate metrics specifically relevant to the task at hand. These metrics capture something about the resources the workload is using (e.g. memory consumption by application objects, or the number of data records processed by a SPARC job) or possibly some business metrics (number of users placing orders or total dollar volume processed). Workload metric names are identified with the prefix workload.googleapis.com. Again, depending on the context, workload metrics may also be referred to as “Application metrics” or “Job metrics.”

External metrics: collected from open source or third party applications. Metrics sent to Google Cloud projects with a metric type beginning with external.googleapis.com are known as external metrics.

Prometheus metrics: some Kubernetes users use Prometheus to monitor their Kubernetes environments. In addition, they propagate Prometheus metrics to Google Cloud Monitoring to take advantage of its rich capabilities. You can configure Prometheus with Cloud Monitoring. In that case metrics exported by Prometheus are converted to Cloud Monitoring metric types.

4. Logs-based metrics:
Logs-based metrics are generated from logs ingested into Cloud Logging. These metrics can be created either by counting log events that match a certain pattern or by extracting and aggregating the fields in specific log events. Logs-based metrics are then written to Cloud Monitoring and can be used for alerting, charting, and dashboarding. Logs-based metrics can be of two types: user-defined (where the definition is created by you) and system-defined (where the definition is available out of the box and you cannot modify them).

Metric Collection

Metrics are collected from Google services and your applications running on Google services in several different ways. Let’s take a look at the different collection mechanisms without going into each specific metric.

1. Collecting metrics from GCE:

GCE metrics are collected using two different mechanisms:

As mentioned earlier, system metrics (or infrastructure metrics) for GCE do not require you to install any metric collection agents. Google automatically collects and pushes these metrics to your project (See Figure 1). System metrics are generally batched before ingestion into Cloud Monitoring.
The second mechanism for collecting GCE metrics is to install the Ops Agent or legacy Monitoring agent. Installing the agent gives you specific advantages in two areas:

Collect system metrics with much finer granularity (less than 1 minute) or access to process metrics for individual Linux or Windows processes.
As your developers deploy their applications in GCE VMs, they can also generate application metrics. The agent collects and loads metrics into your projects almost instantly for faster analysis, alerting and dashboarding.

Figure 1

2. Collecting metrics from GKE (without Prometheus):
GKE metrics are also collected using two different mechanisms when you are not using Prometheus. The metric collection scenario is a bit complex because a GKE cluster has some nodes that are user managed and others, like the control plane nodes, that are Google managed. Much like the GCE environment, system metrics are collected without deploying an agent.

For GKE nodes that are customer managed, metrics can be collected for different resources and workloads. For the GCE VM (underlying the Kubernetes nodes), system metrics are captured by Google collectors and sent to your Cloud project. For collecting platform metrics, Google collectors are automatically deployed in the GKE customer managed nodes when the Kubernetes cluster is created. These collectors gather Kubernetes metrics via the Kubelet and publish them to your project.

Figure 2

For the GKE control plane, the nodes are Google managed and metrics from these nodes are not published to Cloud Monitoring. However, these metrics are collected automatically and used for scaling and resource management decisions by the Kubernetes scheduler. Again, for collecting these metrics, the user does not need to deploy any agent software. These Google collectors are deployed and managed by Google when the clusters are created.

Lastly, your developers can collect Prometheus-compatible metrics emitted from workloads, such as CronJobs or Deployments, on GKE clusters using a fully managed, configurable pipeline from Cloud Monitoring. Your developers configure which metrics to collect, and GKE does everything else.

3. Collecting metrics from GKE (with Prometheus):
Prometheus is a popular choice for monitoring Kubernetes environments. Customers deploying their applications in GKE can continue to use Prometheus for monitoring. Metrics generated by services using the Prometheus exposition format can be exported from the cluster and made visible in Cloud Monitoring.

Figure 3

Collecting metrics from the control plane nodes in this use case is quite similar to the non-Prometheus use case. However, metric collection from the worker nodes is different. Prometheus is deployed on one of the Kubernetes worker nodes (Figure 3) and scrapes metrics from the pods and through the Kubelet API. An adapter collects the metrics from Prometheus and uploads them to Cloud Monitoring.

Chargeable and nonchargeable metrics

Operations teams are always concerned about IT costs and need general clarity on what is chargeable and what is not chargeable in their monitoring, logging, diagnostics, and troubleshooting tools. Visit the Google Cloud’s operations suite pricing page for definitive and current guidance on what is chargeable and what is non chargeable, setting consumption alert thresholds and more. Of the four broad categories of metrics mentioned above, system metrics are non-chargeable. All other categories of metrics are chargeable. There are two exceptions to the above general statement. While Agent collected metrics are chargeable, metrics about the agent itself (in the agent.googleapis.com/agent namespace) are not chargeable. Similarly, while logs-based metrics are chargeable, system-defined logs-based metrics are not chargeable.

Always refer to the pricing page for more details and latest updates.

Summary

Observability of cloud-based services and workloads requires an understanding of a diverse set of metrics that need to be collected and analyzed. These metrics are categorized into system metrics, agent metrics, user defined metrics, and logs-based metrics. This blog discussed metric types, the general architectures used for collecting these metrics and a brief summary on which of these metrics are chargeable metrics and which metrics are non chargeable metrics when using Google Cloud Monitoring.

If you have questions or feedback, please share it in the operations suite section of the Google Cloud Community.

Better Kubernetes application monitoring with GKE workload metrics

Tue, 05 Oct 2021 16:00:00 +0000

Editor’s note (12/15/21): The date that we will begin charging for GKE workload metrics has been rescheduled from December 1, 2021 to February 1, 2022. Please see this page for more information.

The newly released 2021 Accelerate State of DevOps Report found that teams who excel at modern operational practices are 1.4 times more likely to report greater software delivery and operational performance and 1.8 times more likely to report better business outcomes. A foundational element of modern operational practices is having monitoring tooling in place to track, analyze, and alert on important metrics. Today, we’re announcing a new capability that makes it easier than ever to monitor your Google Kubernetes Engine (GKE) deployments: GKE workload metrics.

Introducing GKE workload metrics, currently in preview

For applications running on GKE, we're excited to introduce the preview of GKE workload metrics. This fully managed and highly configurable pipeline collects Prometheus-compatible metrics emitted by workloads running on GKE and sends them to Cloud Monitoring. GKE workload metrics simplifies the collection of metrics exposed by any GKE workload, such as a CronJob or a Deployment, so you don’t need to dedicate any time to the management of your metrics collection pipeline. Simply configure which metrics to collect, and GKE does everything else.

Benefits of GKE workload metrics include:

Easy setup: With a single kubectl apply command to deploy a PodMonitor custom resource, you can start collecting metrics. No manual installation of an agent is required.
Highly configurable: Adjust scrape endpoints, frequency and other parameters.
Fully managed: Google maintains the pipeline, lowering total cost of ownership.
Control costs: Easily manage Cloud Monitoring costs through flexible metric filtering.
Open standard: Configure workload metrics using the PodMonitor custom resource, which is modeled after the Prometheus Operator’s PodMonitor resource.
HPA support: Compatible with the Stackdriver Custom Metrics Adapter to enable horizontal scaling on custom metrics.
Better pricing: More intuitive, predictable, and lower pricing.
Autopilot support: GKE workload metrics is available for both GKE Standard and GKE Autopilot clusters.

Customers are already seeing the benefits of this simplified model.

"With GKE workload metrics, we no longer need to deploy and manage a separate Prometheus server to scrape our custom metrics - it's all managed by Google. We can now focus on leveraging the value of our custom metrics without hassle!" - Carlos Alexandre, Cloud Architect, NOS SGPS S.A., a Portuguese telecommunications and media company.

How to get started

Follow these instructions to enable the GKE workload metrics pipeline in your GKE cluster:

code_block: <ListValue: [StructValue([('code', 'gcloud beta container clusters update YOUR_CLUSTER_NAME \\\r\n --zone=YOUR_ZONE\r\n --project=YOUR_PROJECT\r\n --monitoring=SYSTEM,WORKLOAD'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fb3c7fce520>)])]>

GKE workload metrics is currently available in Preview, so be sure to use the gcloud beta command.

See the GKE workload metrics guide for details about how to configure which metrics are collected as well as a guide for migrating to GKE workload metrics from the Stackdriver Prometheus sidecar.

Pricing

Ingestion of GKE workload metrics into Cloud Monitoring is not currently charged, but it will be charged starting February 1, 2022 (see Editor’s note above. Previous version of the blog noted December 1, 2021 as the start date). See more about Cloud Monitoring pricing.

Cloud Monitoring for modern operations

Once GKE workload metrics are ingested into Cloud Monitoring, you can start using all of the great features of the service including global scalability, long-term (24 month) storage options, integration with Cloud Logging, custom dashboards, alerting, and SLO monitoring. These same benefits already exist for GKE system metrics, which are non-chargeable and are collected by default from GKE clusters and made available to you in the GKE Dashboard.

If you have any questions or want to provide feedback, please visit the operations suite page on the Google Cloud Community.

To the cloud and beyond! Migration Enablement with Google Cloud’s Professional Services Organization

Thu, 09 Sep 2021 16:00:00 +0000

Google Cloud’s Professional Services Organization (PSO) engages with customers to ensure effective and efficient operations in the cloud, from the time they begin considering how cloud can help them overcome their operational, business or technical challenges, to the time they’re looking to optimize their cloud workloads.

We know that all parts of the cloud journey are important and can be complex. In this blog post, we want to focus specifically on the migration process and how PSO engages in a myriad of activities to ensure a successful migration.

As a team of trusted technical advisors, PSO will approach migrations in three phases:

Pre-Migration Planning
Cutover Activities
Post-Migration Operations

While this post will not cover in detail all of the steps required for a migration, it will focus on how PSO engages in specific activities to meet customer objectives, manage risk, and deliver value. We will discuss the assets, processes and tools that we leverage to ensure success.

Pre-Migration Planning

Assess Scope

Before the migration happens, you will need to understand and clarify the future state that you’re working towards. From a logistical perspective, PSO will be helping you with capacity planning to ensure sufficient resources are available for your envisioned future state.

While migration into the cloud does allow you to eliminate many of the considerations for the physical, logistical, and financial concerns of traditional data centers and co-locations, it does not remove the need for active management of quotas, preparation for large migrations, and forecasting. PSO will help you forecast your needs in advance and work with the capacity team to adjust quotas, manage resources, and ensure availability.

Once the future state has been determined, PSO will also work with the product teams to determine any gaps in functionality. PSO captures feature requests across Google Cloud services and makes sure they are understood, logged, tracked, and prioritized appropriately with the relevant product teams. From there, they work closely with the customer to determine any interim workarounds that can be leveraged while waiting for the feature to land, as well as providing updates on the upcoming roadmap.

Develop Migration Approach and Tooling

Within Google Cloud, we have a library of assets and tools we use to assist in the migration process. We have seen these assets help us successfully complete migrations for other customers efficiently and effectively.

Based on the scoping requirements and tooling available to assist in the migration, PSO will help recommend a migration approach. We understand that enterprises have specific needs; differing levels of complexity and scale; regulatory, operational, or organization challenges that will need to be factored into the migration. PSO will help customers think through the different migration options and how all of the considerations will play out.

PSO will work with the customer team to determine the best migration approach for moving servers from on-prem to Google Cloud. PSO will walk customers through different migration approaches, such as refactoring, lift-shift, or new installs. From there, the customer can determine the best fit for their migration. PSO will provide guidance on best practices and use cases from other customers with similar use cases.

Google offers a variety of cloud native tools that can assist with asset discovery, the migration itself, and post-migration optimization. PSO, as one example, will help work with project managers to determine the best tooling that accommodates the customer's requirements for migrating servers. PSO will also engage Google product team to ensure the customer fully understands the capabilities of each tool and the best fit for the use case. Google understands from a tooling perspective, one size does not fit all, thus PSO will work with the customer on determining the best migration approach and tooling for different requirements.

Cutover Activities

Once all of the planning activities have been completed, PSO will assist in making sure the cutover is successful.

During and leading up to critical customer events, PSO can provide proactive event management services which deliver increased support and readiness for key workloads. Beyond having a solid architecture and infrastructure on the platform, support for this infrastructure is essential and TAMs will help ensure that there are additional resources to support and unblock the customer where challenges arise.

As part of event management activities, PSO liaises with the Google Cloud Support Organization to ensure quick remediation and high resilience for situations where challenges arise. A war room is usually created to facilitate quick communication about the critical activities and roadblocks that arise. These war rooms can give customers a direct line to the support and engineering teams that will triage and resolve their issues.

Post-Migration Activities

Once cutover is complete, PSO will continue to provide support in areas such incident management, capacity planning, continuous operational support, and optimization to ensure the customer is successful from start to finish.

PSO will serve as the liaison between the customer and Google engineers. If support cases need to be escalated, PSO will ensure the appropriate parties are involved and work to get the case resolved in a timely manner. Through operational rigor, PSO will work with the customer in determining if certain Google Cloud services will be beneficial to the customer objectives. If services will add value to the customer, PSO will help enable the services so it aligns with the customer's goal and current cloud architecture. In cases where there are missing gaps in services, PSO will proactively work with the customer and Google engineering teams to close the gaps by enabling additional functionality in the services.

PSO will continue to work with the engineering teams to consistently review and provide recommendations on the customer’s cloud architecture in ensuring the most optimal and cost efficient design along with adhering to Google's best practices guidelines.

Aside from migrations, PSO is also responsible for providing continuous training of Google Cloud to customers. To ensure consistent development of Google Cloud, PSO will work with the customer to jointly develop a learning roadmap to ensure the customer has the necessary skills to succeed in delivering successful projects in Google Cloud.

Conclusion

Google PSO will be actively engaged throughout the customer’s cloud journey to ensure the necessary guidance, methodology, and tools are presented to the customer. PSO will engage in a series of activities from pre-migration planning to post migration in key areas such as capacity planning to ensure sufficient resources are allocated for future workloads to providing support on technical cases for troubleshooting. PSO will serve as a long-term trusted advisor who will be the voice of the customer and provide the reliability and stability of the customer's Google Cloud environment.

Click here if you’d like to engage with our PSO team on your migration. Or, you can also get started with a free discovery and assessment of your current IT landscape.

How Lowe’s SRE reduced its mean time to recovery (MTTR) by over 80 percent

Tue, 07 Sep 2021 16:00:00 +0000

Editor’s Note: In a previous blog, we discussed how home improvement retailer Lowe’s was able to increase the number of releases it supports by adopting Google’s Site Reliability Engineering (SRE) framework on Google Cloud. Lowe’s went from one release every two weeks to 20+ releases daily, helping meet its customer needs faster and more effectively. Today, the Lowe’s SRE team shares how they used SRE principles to decrease their mean-time-to-recovery (MTTR) by over 80 percent.

The stakes of managing Lowes.com have never been higher, and that means spotting, troubleshooting and recovering from incidents as quickly as possible, so that customers can continue to do business on our site.

To do that, it’s crucial to have solid incident engineering practices in place. Resolving an incident means mitigating the impact and/or restoring the service to its previous condition. The average time it takes to do this is called mean time to recovery (MTTR). Tracking this metric helps us stay on top of the overall reliability of our systems at Lowe’s, while simultaneously improving the speed with which we recover. Our goal is to keep the MTTR metric as low as possible, so that failures don’t negatively impact our business. Here are the four areas we addressed to drive holistic improvement in our MTTR.

Lowe’s incident reporting process

To reduce MTTR, we created a seamless incident reporting process following SRE principles. Our incident reporting process is a workflow that starts at the time an incident occurs, and ends with an SRE captain who closes the action items after a postmortem report. With this approach, we are able to limit the number of critical incidents. The reporting process involves three core components: monitoring, alerting, and blameless postmortems.

Monitoring and alerting

Having proper monitoring and alerting in place is crucial when it comes to incident management. Monitoring and alerting tools let you detect issues as soon as they occur, and notify the right person in the shortest possible time to take action. From a measurement standpoint, we track this as our mean time to acknowledge (MTTA). This is the average time it takes from when an alert is triggered, to when work on the issue begins.

At the time of an incident, our monitoring and alerting tools notify the on-call SRE first responder via PagerDuty in the form of a phone call, text message and email. Our SRE software engineering team has done a lot of automation to enable various Service Level Indicator (SLI) alerts and Service Level Agreement (SLA) notifications. The on-call SRE then initiates a triage call with our service/domain stakeholders to resolve the incident. As a result, we reduced our MTTA from 30 minutes in 2019, to one minute – a 97 percent decrease.

Blameless postmortems: learning from incidents

A postmortem is a written record of an incident, its impact, the actions taken to resolve it, the root cause and the follow-up actions to prevent the incident from recurring (see example here). A blameless postmortem builds on that and is a core part of an SRE culture, and our culture at Lowe’s. We ensure that individuals are not singled out, and the outcome for all postmortems are directed toward learnings and process improvement.

For us, the postmortem process is the biggest part of our incident workflow. When an SRE creates a new postmortem report, the first step is to conduct a postmortem session with domain stakeholders to review the report. The postmortem then goes into the review stage and gets reviewed by more stakeholders in our weekly postmortem meeting. In the final stage of this process, the SRE captain will close the report once everyone in the weekly meeting agrees that the report is complete.

To conduct a successful postmortem, it is critical to keep the focus on identifying gaps and issues with the system and operations processes, rather than an individual, and generate concrete actions to address the problems we’ve identified. To ensure this, we follow a couple of best practices:

We start by gathering the facts from the person who identified the problem, and each SLI owner has to identify a gap or the next SLI upstream owner who created the impact for them.
Every SLI owner is provided full opportunity to present their case, and identifying the issue is done as a community exercise.
Once action items and process changes are identified, an owner is nominated to complete the actions, or they will volunteer.
For easy reference, we publish and store postmortems in our incident knowledge base. This process helps SREs continuously improve as future incidents arise.

Continuous Improvement

Encouraging a culture of honest, transparent and direct feedback that you need for blameless postmortems is often an iterative process that needs sponsorship from executives, empowering incident captains to lead the entirety of the discussion and outcomes. Running successful postmortems, and completing action items from them, needs to be recognized and accounted for in SRE performance objective assessment. As shared in Google’s SRE book, the best practice is to ensure that writing effective postmortems is a rewarded and celebrated practice, with leadership’s acknowledgement and participation. This is possibly the hardest part to accomplish in an effective postmortem during a cultural transformation unless you have full buy-in from leadership.

However, it’s all well worth it. This process is a key part of how we were able to improve our MTTR over time—from two hours in 2019 to just 17 minutes!

Our SRE incident reporting process has also transformed how our company solves issues. By streamlining this workflow from alerting, to solving an issue, to blameless postmortems, we have reduced our MTTR by 82 percent and our MTTA by 97 percent. Most importantly, our team is learning from every incident and becoming better engineers as a result. Visit the SRE Google Cloud website to learn more about implementing SRE best practices in the cloud.

Acknowledgement

Special thanks to Rahul Mohan Kola Kandy, Vivek Balivada, and the Digital SRE team at Lowe’s for contributing to this blog post.

Zero effort performance insights for popular serverless offerings

Fri, 20 Aug 2021 16:00:00 +0000

Inevitably, in the lifetime of a service or application, developers, DevOps, and SREs will need to investigate the cause of latency. Usually you will start by determining whether it is the application or the underlying infrastructure causing the latency. You have to look for signals that indicate the performance of those resources when the issue occurred.

Using traces as your latency signals

In most instances, the signals that provide the richest information for latency are traces. Traces represent the total time it takes for a request to propagate through every layer of a distributed system, including the load balancer, computes, databases and more during execution. The subset of traces used to represent each layer of the execution are referred to as spans.

The difficulty of generating traces has prevented many users from accessing this useful troubleshooting resource. To make them more easily available to developers, we've started instrumenting our most popular serverless compute options, App Engine, Cloud Run and Cloud Functions to generate traces by default. While this will not provide the full picture of what is going on in a complex distributed system, it will provide crucial pieces of information needed to decide which area to focus on during troubleshooting.

What do I need to do to get this benefit today?

The simple answer is, nothing! Once your code is deployed in any serverless compute like App Engine, Cloud Run or Cloud Functions, any ingress or egress traffic through the compute automatically generates spans that are captured and stored in Cloud Trace. These spans are stored for 30 days at no additional cost. See additional terms here. The resulting traces can be visualized as waterfall graphs with representative values of latency. In addition, we have extended this capability to Google Cloud databases, with Cloud SQL Insights generating traces representative of query plans for PostgreSQL and sending them to Cloud Trace.

The screenshot below is a Day 1 trace captured from a simple “Helloworld'' application deployed in Cloud Run. The load balancer span (i.e. root span) is indicative of the total time through Google Cloud’s infrastructure and the Cloud Run span is indicative of the time it took for the compute to execute and service the request.

As you can see in the graphic below, the loadbalancer span is roughly equal to the Cloud Run span, so we can conclude that any observed latency is not being caused by Google’s infrastructure. At this point you can focus more on your code.

This is awesome, how do I extend it?

You must still instrument your application if you want it to generate more granular spans representative of the code's execution. You can start here to pick the library that matches your development language and for instructions on how to instrument your code. Once this is done, your traces will get richer, encompassing more spans with information about both the performance of the infrastructure and application in one single waterfall view.

Cloud Trace – Google Cloud’s hub for Infrastructure traces

We are excited about the future of telemetry in Google Cloud. Upcoming releases in the next six months will touch on infrastructure instrumentation and areas like trace analysis, metrics, integrations to other Google Cloud products and integrations with third party APM products.

Next Steps

Explore the traces from your infrastructure in your Cloud Trace console and explore the available libraries and procedures for application instrumentation. If you have questions or feedback about this new feature, head to the Cloud Operations Community page and let us know!

Use Process Metrics for troubleshooting and resource attribution

Wed, 18 Aug 2021 16:00:00 +0000

When you are experiencing an issue with your application or service, having deep visibility into both the infrastructure and the software powering your apps and services is critical. Most monitoring services provide insights at the Virtual Machine (VM) level, but few go further. To get a full picture of the state of your application or service, you need to know what processes are running on your infrastructure. That visibility into the processes running on your VMs is provided out of the box by the new Ops Agent and made available by default in Cloud Monitoring. Today we will cover how to access process metrics and why you should start monitoring them.

Better visibility with process metrics

The data gathered by process metrics include CPU, memory, I/O, number of threads, and more, for any running processes and services on your VMs. When the Ops Agent or the Cloud Monitoring agent is installed, these metrics are captured at 60-second intervals and sent to Cloud Monitoring so you can visualize, analyze, track, and alert on them. A single VM may run tens or hundreds of processes, while you may have tens of thousands running across your fleet of VMs.

As a developer, you may only care about seeing inside a single VM to troubleshoot and identify memory leaks or the source of performance issues.

As an operator or IT Admin, you may be interested in aggregate resource consumption, building baseline views of compute, storage, and networking usage across your VM fleet. Then, when those baseline consumption levels break normal behaviors, you will know when to investigate your systems.

Built for scale and ease of use

Cloud Monitoring is built on the same advanced backend that powers metrics across Google. This proven scalability means your metrics ingestion will be supported despite the extremely high cardinality. Additionally, our agents do not require any config file changes to turn on process metric monitoring.

Lastly, our goal is to provide you the observability and telemetry data where, and when, you need it. So, like the rest of the operations suite, we deliver process metrics in the context of your infrastructure, directly in the VM admin console.

Navigating to a single VM’s in-context process monitoring in GCE

The navigation is simple. Once you have the Ops Agent or the Cloud Monitoring agent installed in your VMs:

Go to the Compute Engine console page and click on VM Instances
Select the VM that you want to investigate
In the navigation menu on the top, click Observability
Click on Metrics
Lastly, click on Processes

In the window on the right you will see a chart and a table with all of the processes in your VM. You can also filter by time frame and sort by name or value. You do not need to do anything, other than have the agent installed, for the process to be detected and displayed.

Fleet-wide metrics monitoring

Cloud Monitoring gives you a look across your fleet of VMs so you can identify the aggregated usage of resources by processes. This level of broad, yet granular, insight can drive your decisions around which software to run or how many VMs you need to optimally power your apps and services. Admins can perform a cost-savings analysis if they determine that certain processes are slowing down the work of a large number of VMs. The larger numbers of less powerful VMs can be replaced by fewer, more capable VMs.

To get this fleet-wide view:

Navigate to Cloud Monitoring
Click Dashboards in the left menu
In the All Dashboards list, click on VM Instances
Towards the top of the window, click on Processes

This provides many charts detailing the processes running across your fleet of VMs.

The new Cloud Monitoring VM Fleet-wide Process view in the VM Instance Dashboard

Get started today

To start identifying and monitoring your process metrics, you must first install the Ops Agent, or have installed the legacy Cloud Monitoring agent. Once that is complete, the process metrics data will automatically be ingested into Cloud Monitoring and the VM admin console.

If you have any questions, or to join the conversation with other developers, operators, DevOps, and SREs, visit the Cloud Operations page in the Google Cloud Community.

Verify GKE Service Availability with new dedicated uptime checks

Fri, 13 Aug 2021 16:00:00 +0000

Keeping the experience of your end user in mind is important when developing applications. Observability tools help your team measure important performance indicators that are important to your users, like uptime. It’s generally a good practice to measure your service internally via metrics and logs which can give you indications of uptime, but an external signal is very useful as well, wherever feasible.

One of the easiest ways to measure your services externally is to use an established and trusted technology, an uptime check. Uptime checks closely monitor the availability of your service and can serve as a leading indicator of a problem. This can hopefully help you reduce or eliminate the time an issue affects your users.

Uptime checks for GKE services

With the proliferation of the microservices architecture, more services means more endpoints to measure. Trying to track, isolate, and resolve issues can be increasingly complex. That’s why we’re excited to introduce the new uptime check for Google Kubernetes Engine (GKE) LoadBalancer services.

Google Cloud has offered uptime checks for different types of resources, but none of these provided a direct association with GKE. With our new integration, the GKE LoadBalancer uptime check directly associates a service load balancer with an uptime check, helping to ensure the uptime check is managed dynamically. As the underlying network for a service changes the uptime check changes with it, allowing you to quickly correlate a service with an uptime failure.

You can also set up an alert policy based on your uptime check, allowing your SRE or Ops team to be notified of a meaningful issue that’s impacting your service.Once notified, you can jump straight into the associated GKE Dashboard to better isolate the root cause.

Creating a new uptime check

To get started, you can head to Monitoring > Uptime and select “+ Create Uptime Check” and then select the new Kubernetes Loadbalancer Service option.

More information

Visit our documentation for Managing uptime checks, where you can get additional information and step by step instructions for creating your first uptime check.

Lastly, if you have questions or feedback about this new feature, head to the Cloud Operations Community page and let us know!

Monitor and troubleshoot your VMs in context for faster resolution

Thu, 12 Aug 2021 16:00:00 +0000

Troubleshooting production issues with virtual machines (VMs) can be complex and often requires correlating multiple data points and signals across infrastructure and application metrics, as well as raw logs. When your end users are experiencing latency, downtime, or errors, switching between different tools and UIs to perform a root cause analysis can slow your developers down. Saving time when accessing the necessary data, deploying fixes, and verifying those fixes can save your organization money and keep the confidence of your users.

We are happy to announce the general availability of an enhanced “in-context” set of UI-based tools for Compute Engine users to help make the troubleshooting journey easier and more intuitive. From the Google Console, developers can click into any VM and access a rich set of pre-built visualizations designed to give insights into common scenarios and issues associated with CPU, Disk, Memory, Networking, and live processes. With access to all of this data in one location, you can easily correlate between signals over a given timeframe.

Bringing more operations data to your VMs

A collection of high-level metrics has always been available in the Compute Engine console page. However, your feedback let us know that you still had to navigate between different tools to perform a proper root cause analysis. For example, seeing that CPU utilization peaked during a certain time frame might be a helpful starting point, but resolving the issue will require a deeper understanding of what is driving the utilization. Furthermore, you will want to correlate this data with processes, and other signals such as I/O wait time versus user space versus kernel space.

With this in mind, we added metrics, charts, and a variety of new visualizations to the Compute Engine page, many requiring zero setup time. Some of these new additions are populated with in-depth metrics provided by the Google Cloud Ops Agent (or legacy agents if you’re currently using them), which can easily be installed via Terraform, Puppet, Ansible or an install script.

Some of the observability data available when you click into your VM

New charts that leverage the metrics from the Ops Agent include: CPU utilization as reported by the OS, Memory Utilization, Memory breakdown by User, Kernel, and Disk Cache, I/O Latency, Disk Utilization and Queue Length, Process Metrics, and many more.

A more detailed look at the data available when you click into the VM, including metrics and logs

While no single troubleshooting journey fits all needs, this enhanced set of observability tools should make the following scenarios faster and more intuitive:

Identifying networking changes via metrics and logs. By comparing unexpected increases in network traffic, network packet size, or spikes in new network connections against logs by severity, developers might identify a correlation between this traffic increase and critical logs errors. By further navigating to the Logs section of the tools, one can quickly filter to critical logs only and expand sample log messages to discover detailed logs around timeout messages or errors caused by the increased load. Deep links to the Logs Explorer filtered to the VM of interest allows for fast and seamless navigation between Compute Engine and Cloud Logging.

Determining the impact of specific processes on utilization. By comparing times of high CPU or memory utilization against top processes, operators can determine whether a specific process (as denoted by command line or PID) is over-consuming. They can then refactor or terminate a process altogether, or choose to run a process on a machine better suited for its compute and memory requirements. Alternatively, there may be many short-lived processes that do not show up in the processes snapshot, but are visible as a spike in the Process Creation Rate chart. This can lead to a decision to refactor so that process duration is distributed more efficiently.

Choosing appropriate disk size for workloads. A developer may notice that the "Peak 1-second IOPS" have begun to hit a flat line, indicating the disk is hitting a performance limit. If the "I/O Latency Avg" also shows a corresponding increase, this could indicate that I/O throttling is occurring. Finally, breaking this down the Peak IOPS by Storage Type, one might see that Persistent Disk SSD is responsible for the majority of the peak IOPS, which could lead to a decision to increase the size of the disk to get a higher block storage performance limit.

Security Operations and Data Sovereignty. Operators may be in charge of enforcing security protocols around external data access, or creating technical architecture for keeping data within specific regions for privacy and regulatory compliance. Using the Network Summary, operators can determine at a glance whether a VM is establishing connections and sending traffic primarily to VMs and Google services within the same project, or whether connections and traffic ingress / egress may be occurring externally. Likewise, operators can determine whether new connections are being established or traffic is being sent to different regions or zones, which may lead to new protocols to block inter-region data transfer.

Cost optimization via networking changes. A developer may notice that the majority of VM to VM traffic is being sent inter-region, as opposed to traffic remaining in the same region. Because this inter-region traffic is slower and is charged at an inter-region rate, the developer can choose to reconfigure the VM to communicate instead with local replicas of the data it needs in its same region, thus reducing both latency and cost.

Measuring and tuning memory performance. The Ops Agent is required for most VM families to collect memory utilization. By examining the memory usage by top processes, a developer may detect a memory leak, and reconfigure or terminate the offending process. Additionally, an operator may examine the breakdown of memory usage by type and notice that disk cache usage has hit the limit of using all memory not in use by applications, correlating with an increase in disk latency. They may choose to upsize to a memory-optimized VM to allow enough memory for both applications and disk caching.

These are just a few of the use cases where your team may leverage these new capabilities to spend less time troubleshooting, optimize for costs, and improve your overall experience with Compute Engine.

Get Started Today

To get started, navigate to Compute Engine > VM Instances, click into a specific VM of interest, and navigate to the Observability tab. You can also check out our developer docs for additional guidance on how to use these tools to troubleshoot VM performance, and we recommend installing the Ops Agent on your Compute Engine VMs to get the most out of these new tools. If you have specific questions or feedback, please join the discussion on our Google Cloud Community, Cloud Operations page.

Troubleshoot GKE apps faster with monitoring data in Cloud Logging

Tue, 10 Aug 2021 16:00:00 +0000

When you’re troubleshooting an application on Google Kubernetes Engine (GKE), the more context that you have on the issue, the faster you can resolve it. For example, did the pod exceed it’s memory allocation? Was there a permissions error reserving the storage volume? Did a rogue regex in the app pin the CPU? All of these questions require developers and operators to build a lot of troubleshooting context.

Cloud Monitoring data for GKE in Cloud Logging

To make it easier to troubleshoot GKE apps, we’ve added contextual Cloud Monitoring data accessible right from Cloud Logging. With this new feature, you can easily see the relevant pod, node and cluster events, metrics, alerts, and SLOs right from the log line itself. Additionally, the data loaded for a specific log entry is scoped to the Kubernetes resource, which saves you valuable time while investigating an app error.

Today’s announcement builds on other recent integrations including the addition of a logs tab nested in the details page of each of your GKE resources and combining metrics and logs in the GKE Dashboard in Monitoring. Now, wherever you start your troubleshooting journey – in Monitoring, Logging or GKE – you have the observability data at your fingertips.

For example, if you’re troubleshooting a GKE app error in Cloud Logging and looking at the app logs, you can now view the metric charts for container restarts, uptime, memory, CPU and storage without leaving the log entry. Active alerts are highlighted on the alerts tab, which can provide helpful context for troubleshooting. This unique and integrated experience brings together critical log and metric data for the specific Kubernetes resource where your app is running.

Viewing Monitoring data for GKE from a log line

From a k8s_container, k8s_pod, k8s_node, or k8s_cluster log, select the blue chip with the resource.labels resource name and then select “View Monitoring details” to access an integrated metrics panel directly from the Logs Explorer. Selecting “View in GKE” opens the detailed view of the GKE resource in the Cloud Console on a new tab.

The metrics panel provides a lot of contextual data including alerts, Kubernetes events and metrics related to the GKE resource.

Alerts

Alerts triggered by the GKE resource are displayed under the alerts tab. The color-coded alert status provides an easy way to see ongoing, acknowledged and closed incidents. Selecting “VIEW INCIDENT” opens the incident details in Cloud Monitoring. If you want to create a new alert, use the link to create a brand new alert policy.

Kubernetes events for clusters and pods

The metrics panel provides select events for clusters and pods. For each event, the name, associated resource and a link to view/copy the log message are displayed. Kubernetes events can provide important information to help determine the root cause of an issue. For example, if a FailedScheduling event is displayed, this can quickly guide troubleshooting to check the resources available to the Kubernetes resource.

Metrics for containers, pods and nodes

The metrics tab contains metrics bundles for container (default), pod and node metrics collected from the GKE cluster and reported in Cloud Monitoring. Each metric bundle offers pre-built charts that can be selected to view the CPU, memory, storage and container restarts. For example, by looking at the CPU or memory, you can determine whether there were any spikes in the metrics for the Kubernetes resources.

More to come

We’re committed to making Google Cloud’s operations suite the best place to troubleshoot your GKE apps. We’ve integrated logs directly into GKE resource details pages and built a specialized integrated GKE Dashboard, all to make it easier to troubleshoot GKE apps. However, there is still more coming and we’re already working hard to add new features to the metrics panels to surface even more context for troubleshooting GKE apps.

Get started today

If you haven’t already, to get started with Cloud Logging and Cloud Monitoring on GKE, view documentation, watch a quick video on troubleshooting services on GKE and join the discussion in our new Cloud Operations page on the Google Cloud Community site.

Use log buckets for data governance, now supported in 23 regions

Mon, 09 Aug 2021 16:00:00 +0000

Logs are an essential part of troubleshooting applications and services. However, ensuring your developers, DevOps, ITOps, and SRE teams have access to the logs they need, while accounting for operational tasks such as scaling up, access control, updates, and keeping your data compliant, can be challenging. To help you offload these operational tasks associated with running your own logging stack, we offer Cloud Logging. If you don’t need to worry about data residency, Cloud Logging will pick a region to store and process your logs.

If you do have data governance and compliance requirements, we’re excited to share that Cloud Logging now offers even more flexibility and control by providing you a choice of which region to store and process your logging data. In addition to the information below, we recently published a whitepaper that details compliance best practices for logs data.

Choose from 23 regions to help keep your logs data compliant

Log entries from apps and services running on Google Cloud will automatically be received by Cloud Logging within the region where the resource is running. From there, logs will be stored in log buckets. Log buckets have many attributes in common with Cloud Storage buckets, including the ability to:

Set retention from 1 day to 10 years
Lock a log bucket to prevent anyone from deleting logs or reducing the retention period of the bucket
Choose a region for your log bucket. We recently introduced support for 23 regions to host your log buckets:

Europe - europe-central2, europe-north1, europe-west1, europe-west2, europe-west3, europe-west4, europe-west6
Americas - us-central1, us-east1, us-east4, us-west1, us-west2, us-west3, northamerica-northeast1, southamerica-east1
Asia Pacific - asia-east1, asia-east2, asia-northeast1, asia-northeast2, asia-northeast3, asia-south1, asia-southeast1, australia-southeast1

How to create a log bucket

You can get started with regionalized log storage in less than five minutes.

Go to the Cloud Console and go to Logging
Navigate to Logs Storage and click on “Create logs bucket”
Name the log bucket and choose the desired region. Note that the region cannot be changed later.
Set the retention period and then click Create Bucket.

Once you have created the bucket, you need to point the incoming logs to that bucket. To complete this:

Go to the Logs Router section of the Cloud Console and click on the dots to the right of the _Default sink.
Select “Edit Sink”
Under Sink Destination, change the log bucket selected from “projects/.../_Default” to “projects/.../ (name of newly created bucket)”.
Scroll to the bottom and select “Update sink” to save the changes

If you need more detailed information on this topic, please see our step by step getting started guide for overcoming common logs data compliance challenges.

More about data residency in Cloud Logging

We have covered a lot of information about logs in this blog. For more on this topic and other best practices for compliance with logs data, please download this whitepaper. We hope this helps you focus on managing your apps rather than your operations. If you would like to pose a question or join the conversation about Google Cloud operations with other professionals, please visit our new community page. Happy Logging!

Cloud Operations

Flexible committed use discounts — a simple new way to discount Compute Engine instances

What customers are saying about flexible CUDs

Understanding flexible CUDs

Get started with flexible CUDs today

Some beans and gems, some snakes and elephants, with Java 17, Ruby 3, Python 3.10 and PHP 8.1 in App Engine and Cloud Functions

Java

Ruby

Python

PHP

Your turn

Supercharge your event-driven architecture with new Cloud Functions (2nd gen)

Optimize your applications using Google Vertex AI Vizier

Performance Optimization

What is the problem

Framing as an optimization problem

Objective

Parameter space

Approach

Results of our study

Learnings on the method

Conclusion and next steps

References

Creating custom notifications with Cloud Monitoring and Cloud Run

Objectives

Costs

Before you begin

Integration with Google Chat

Looking at the Cloud Run code

Deploying the app

Set up your GitHub repository

Deploy the Google Chat integration

Setting up webhook urls

Hardcoded Webhook URLs

Manual GCS Bucket Webhook URLs

Manual Deployment

Continuous Deployment setup

Cleaning up

Delete the project

In the Cloud Console, go to the Manage resources page.Go to the Manage resources page

Delete tutorial resources

Expanding to other 3rd-party services

Webhook, Pub/Sub, and Slack Alerting notification channels launched

How to Configure Webhook, Pub/Sub, or Slack Notifications

What’s Next

Patterns for better insights and troubleshooting with hybrid cloud logs

Improving customer operations

Pattern 1: Cloud Logging for resource management and troubleshooting

Pattern 2: BigQuery for log analytics in hybrid and multi-cloud scenarios

Final Thoughts

Get started today

How to deploy the Google Cloud Ops Agent with Ansible

What is Ansible, and how does it work?

Deploying the Ops Agent to monitor and troubleshoot VMs

Adding workload specifics to your configuration

Get started today

The Ops Agent is now GA and it leverages OpenTelemetry

How Sabre is using SRE to lead a successful digital transformation

1. Find colleagues who are also passionate about shifting culture and adopting SRE

2. Get your mid level leadership stakeholders on board

3. Don’t be afraid to get help from professionals

Four steps to jumpstarting your SRE practice

Get planet-scale monitoring with Managed Service for Prometheus

Details about Google Cloud Managed Service for Prometheus, currently in preview

An easy to use service with open source interfaces

Customers already seeing success in preview

How to get started

Enabling SRE best practices: new contextual traces in Cloud Logging

Faster insights and correlations with logs and traces

How traces are generated in Google Cloud services

Get started today

OpenTelemetry Trace 1.0 is now available

Google Cloud Monitoring 101: Understanding metric types

Types of Metrics

Metric Collection

Chargeable and nonchargeable metrics

Summary

Troubleshoot GKE apps faster with monitoring data in Cloud Logging

Better Kubernetes application monitoring with GKE workload metrics

Introducing GKE workload metrics, currently in preview

In the Cloud Console, go to the Manage resources page.
Go to the Manage resources page