Implementing SRE Culture and Principles: Explaining a Service Outage

What Happened During the Recent Service Outage?

Question

Your organization wants to implement Site Reliability Engineering (SRE) culture and principles.

Recently, a service that you support had a limited outage.

A manager on another team asks you to provide a formal explanation of what happened so they can action remediations.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

B.

In a Site Reliability Engineering (SRE) culture, postmortems are an essential practice to prevent future outages and to continuously improve the reliability of the system. A postmortem is a document that describes what happened during the outage, its root causes, the actions taken to mitigate the problem, and the lessons learned. It also includes a list of action items that should be prioritized based on their impact on the system and the risk of recurrence.

In this scenario, a limited outage occurred, and a manager from another team requested a formal explanation of what happened. To address this request, the best approach is to develop a postmortem that includes the root causes, resolution, lessons learned, and a prioritized list of action items.

Option A, which suggests sharing the postmortem with the manager only, is not the best approach. In an SRE culture, postmortems are shared widely to ensure that everyone involved in the system understands what happened and how to prevent similar incidents in the future. It is essential to share the postmortem with the engineering organization, and even stakeholders outside the organization may benefit from reading it.

Option B, which suggests sharing the postmortem on the engineering organization's document portal, is the best approach. Sharing the postmortem on the organization's document portal allows anyone in the engineering organization to access it and learn from it. This approach aligns with the principles of SRE, where transparency and knowledge sharing are essential.

Options C and D, which suggest listing people responsible for the outage and action items for each person, are not recommended. Blaming individuals is not an SRE principle, and it can create a culture of fear and avoidance of accountability. Instead, the focus should be on identifying the system's weaknesses and finding ways to improve them.

In summary, the best approach in this scenario is to develop a postmortem that includes the root causes, resolution, lessons learned, and a prioritized list of action items and share it on the engineering organization's document portal.