Monitor Google Kubernetes Engine (GKE) Clusters in Cloud Monitoring Workspace

Triage Incidents Quickly as a Site Reliability Engineer (SRE)

Question

You are monitoring Google Kubernetes Engine (GKE) clusters in a Cloud Monitoring workspace.

As a Site Reliability Engineer (SRE), you need to triage incidents quickly.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

D.

https://cloud.google.com/monitoring/charts/dashboards

As a Site Reliability Engineer (SRE), the primary goal is to triage incidents quickly and ensure that the system is always up and running. In this scenario, the SRE is monitoring Google Kubernetes Engine (GKE) clusters in a Cloud Monitoring workspace. The SRE must determine the best approach to triage incidents quickly.

Option A: Navigate the predefined dashboards in the Cloud Monitoring workspace, and then add metrics and create alert policies.

This option is the most appropriate for triaging incidents quickly. Cloud Monitoring provides predefined dashboards with important metrics that enable the SRE to quickly identify the source of the problem. The SRE can add additional metrics as needed to create a custom dashboard that meets specific needs. Alert policies can also be created in Cloud Monitoring to notify the SRE of potential problems.

Option B: Navigate the predefined dashboards in the Cloud Monitoring workspace, create custom metrics, and install alerting software on a Compute Engine instance.

This option is not as appropriate for triaging incidents quickly. While creating custom metrics can provide additional information, it can also be time-consuming. Installing alerting software on a Compute Engine instance is not necessary, as Cloud Monitoring provides built-in alerting capabilities.

Option C: Write a shell script that gathers metrics from GKE nodes, publish these metrics to a Pub/Sub topic, export the data to BigQuery, and make a Data Studio dashboard.

This option is not the most appropriate for triaging incidents quickly. Writing a shell script to gather metrics, publishing them to Pub/Sub, and exporting them to BigQuery can be time-consuming and complex. Data Studio dashboards are useful for displaying data, but they are not the best solution for triaging incidents quickly.

Option D: Create a custom dashboard in the Cloud Monitoring workspace for each incident, and then add metrics and create alert policies.

This option is not the most appropriate for triaging incidents quickly. Creating a custom dashboard for each incident can be time-consuming, and it may not be necessary. Cloud Monitoring provides predefined dashboards that can be customized as needed. Alert policies can also be created to notify the SRE of potential problems.

In conclusion, the best approach for triaging incidents quickly is to navigate the predefined dashboards in the Cloud Monitoring workspace, add metrics as needed, and create alert policies.