Incident Summary: Best Practices for Site Reliability Engineering

Incident Summary

Question

You encountered a major service outage that affected all users of the service for multiple hours.

After several hours of incident management, the service returned to normal, and user access was restored.

You need to provide an incident summary to relevant stakeholders following the Site Reliability Engineering recommended practices.

What should you do first?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

A.

When encountering a major service outage, it is essential to follow best practices recommended by Site Reliability Engineering (SRE) to minimize the impact of the incident and ensure a speedy recovery. One of these practices is to provide an incident summary to relevant stakeholders to ensure transparency and to help prevent future occurrences.

The first step in providing an incident summary is to gather as much information as possible about the incident. This includes identifying the root cause of the incident, the actions taken to address it, and the impact it had on users. This information can be gathered from incident reports, system logs, and incident response team debriefs.

Once this information has been gathered, the next step is to develop a post-mortem report that summarizes the incident and its impact. This report should include the following:

  1. Summary of the incident: A brief description of what happened, when it occurred, and the impact it had on users.

  2. Root cause analysis: A detailed analysis of the root cause of the incident, including any contributing factors that may have led to the incident.

  3. Timeline of events: A timeline of the incident, including when it was detected, when it was escalated, and when it was resolved.

  4. Impact analysis: An analysis of the impact the incident had on users, including the number of users affected and the length of the outage.

  5. Lessons learned: A list of lessons learned from the incident, including any changes that need to be made to prevent future occurrences.

Once the post-mortem report has been developed, it should be distributed to all relevant stakeholders. This may include senior management, the incident response team, and any other teams that were involved in the incident.

It is not necessary to call individual stakeholders to explain what happened, as this can be time-consuming and may not be an effective way to communicate the incident. Instead, the post-mortem report should be distributed to all stakeholders, along with any other relevant documentation, such as the Incident State Document.

Requiring the engineer responsible to write an apology email to all stakeholders is also not necessary, as the focus should be on identifying the root cause of the incident and preventing future occurrences. However, it may be appropriate for the engineer to provide input into the post-mortem report and to participate in any follow-up actions that are required.