Supporting Service Reliability and SLO Improvement | PCDE Exam Question Answer | Google

How to Shift Development Team's Focus to Improve Service Reliability

Question

You support a large service with a well-defined Service Level Objective (SLO)

The development team deploys new releases of the service multiple times a week.

If a major incident causes the service to miss its SLO, you want the development team to shift its focus from working on features to improving service reliability.

What should you do before a major incident occurs?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

B.

The correct answer to this question is A. Develop an appropriate error budget policy in cooperation with all service stakeholders.

An error budget is a tool that allows teams to balance the need for feature development and service reliability. It is a way of setting a threshold for acceptable service reliability and allowing a certain amount of error to occur without impacting users. The error budget policy should be developed in cooperation with all service stakeholders, including the development team, product team, and operations team.

By setting up an error budget policy, the team can determine how much time and resources can be allocated to feature development versus reliability improvements. If a major incident causes the service to miss its SLO, the team can shift its focus from feature development to reliability improvements without jeopardizing the service's overall health and stability.

Option B, Negotiate with the product team to always prioritize service reliability over releasing new features, is not the best solution because it is unlikely that the product team will agree to always prioritize service reliability over releasing new features. The product team has to balance business objectives and customer needs, and sometimes, releasing new features is necessary to stay competitive.

Option C, Negotiate with the development team to reduce the release frequency to no more than once a week, is also not the best solution because it does not address the root cause of the problem. The service's reliability should not depend on the release frequency, but on the reliability of the code itself. Reducing the release frequency may provide a temporary solution, but it will not solve the underlying problem.

Option D, Add a plugin to your Jenkins pipeline that prevents new releases whenever your service is out of SLO, is also not the best solution because it is too restrictive. It may prevent the development team from releasing new features even when the service is stable and within its error budget. Additionally, it may create friction between the development team and the operations team, as the development team may feel that their ability to release new features is being unfairly restricted.