You encounter a large number of outages in the production systems you support.
You receive alerts for all the outages that wake you up at night.
The alerts are due to unhealthy systems that are automatically restarted within a minute.
You want to set up a process that would prevent staff burnout while following Site Reliability Engineering practices.
What should you do?
Click on the arrows to vote for the correct answer
A. B. C. D.A.
The situation described in the question is a common challenge faced by many DevOps teams. The team is receiving a large number of alerts that are being triggered by systems that are automatically restarting within a minute, causing staff burnout. The team wants to set up a process that follows Site Reliability Engineering (SRE) practices and prevents staff burnout.
Here are the details of the answer options:
A. Eliminate unactionable alerts: This option suggests that the team should eliminate alerts that are not actionable. Unactionable alerts are those that do not require immediate attention or intervention. This is a good practice as it helps to reduce noise and focus on the important alerts. However, it is important to identify which alerts are unactionable and eliminate them in a way that does not compromise the overall system reliability.
B. Create an incident report for each of the alerts: This option suggests that the team should create an incident report for each alert that is triggered. Incident reports are useful in documenting the details of the incident, including the root cause, the impact, and the steps taken to resolve the issue. This is a good practice as it helps the team to learn from past incidents and improve the overall system reliability. However, creating a report for each alert can be time-consuming and may not be feasible in situations where there are many alerts.
C. Distribute the alerts to engineers in different time zones: This option suggests that the team should distribute the alerts to engineers in different time zones. This is a good practice as it helps to ensure that the team is not overloaded with alerts and that someone is always available to respond to the alerts. However, this option may not be practical in situations where the team is small or where there are not enough engineers in different time zones.
D. Redefine the related Service Level Objective (SLO) so that the error budget is not exhausted: This option suggests that the team should redefine the SLO so that the error budget is not exhausted. The error budget is the amount of downtime that is acceptable within a given period of time. Redefining the SLO can help to ensure that the team is not overloaded with alerts and that the system is more resilient to failures. However, redefining the SLO should be done carefully to ensure that it does not compromise the overall system reliability.
In summary, the best answer to this question is a combination of options A and D. The team should eliminate unactionable alerts and redefine the SLO so that the error budget is not exhausted. This will help to reduce noise and prevent staff burnout while ensuring that the system remains reliable. Creating incident reports for each alert and distributing alerts to engineers in different time zones are also good practices but may not be practical in all situations.