Post-Mortem for Outage: Best Practices in Site Reliability Engineering

Site Reliability Engineering Post-Mortem: Mitigating Outage Impact

Question

You support a service that recently had an outage.

The outage was caused by a new release that exhausted the service memory resources.

You rolled back the release successfully to mitigate the impact on users.

You are now in charge of the post-mortem for the outage.

You want to follow Site Reliability Engineering practices when developing the post-mortem.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

B.

As a Site Reliability Engineer (SRE), your primary goal is to ensure that the systems and services are reliable, scalable, and efficient. When an outage occurs, it is important to conduct a post-mortem to understand the root cause of the problem and identify ways to prevent similar issues from happening in the future.

Option A is not correct because SRE practices prioritize reliability over new features. It is important to prevent similar outages from recurring, even if it means delaying new features.

Option B is the correct answer. The primary goal of a post-mortem is to identify the contributing causes of the incident rather than the individual responsible for the cause. The purpose of this is to avoid the blame game and foster a culture of learning and continuous improvement. The goal is to identify and address systemic issues that caused the incident, not to punish individuals.

Option C is not the best approach since it focuses on individual meetings and assigning blame rather than identifying systemic issues. A post-mortem should involve all the engineers who were involved in the incident and should be focused on understanding the root cause and identifying ways to prevent similar incidents from happening in the future.

Option D is not the best approach since it is focused on punishing the engineer who made the commit rather than identifying the root cause of the issue. It is important to understand that mistakes happen, and the goal should be to learn from them and prevent similar issues from happening in the future. Additionally, preventing an engineer from working on production services is not an effective way to prevent similar incidents from happening in the future.

In conclusion, the best approach to follow Site Reliability Engineering practices when developing a post-mortem is to focus on identifying the contributing causes of the incident rather than the individual responsible for the cause. This will help to avoid the blame game and foster a culture of learning and continuous improvement.