Supporting Reliable User-Facing Web Applications

Balancing Velocity and Reliability for Improved SLO

Question

You support a user-facing web application.

When analyzing the application's error budget over the previous six months, you notice that the application has never consumed more than 5% of its error budget in any given time window.

You hold a Service Level Objective (SLO) review with business stakeholders and confirm that the SLO is set appropriately.

You want your application's SLO to more closely reflect its observed reliability.

What steps can you take to further that goal while balancing velocity, reliability, and business needs? (Choose two.)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

AD.

The goal of this question is to determine the steps that can be taken to more closely align a web application's Service Level Objective (SLO) with its observed reliability while balancing velocity, reliability, and business needs.

Firstly, it is important to understand the concept of SLO. An SLO is a measurable target for the level of service that a system should provide, typically expressed as a percentage. It is used to set expectations for system reliability and availability, and is typically agreed upon with business stakeholders. It is important to note that an SLO is not the same as a Service Level Agreement (SLA), which is a contractual agreement between a provider and a customer that defines the minimum level of service that the provider will deliver.

In this scenario, the web application's error budget has never exceeded 5% in any given time window. An error budget is the amount of errors that a system can tolerate before its SLO is violated. The fact that the error budget has never been exceeded suggests that the current SLO is set appropriately, as it is aligned with the application's observed reliability.

However, the question asks how to more closely align the SLO with the application's observed reliability. There are two steps that can be taken to achieve this goal while balancing velocity, reliability, and business needs:

  1. Implement and measure additional Service Level Indicators (SLIs) for the application: SLIs are metrics that measure the behavior of a system. They are used to quantify the level of service provided by a system and are often used to calculate SLOs. By implementing and measuring additional SLIs, it is possible to gain a more comprehensive understanding of the application's behavior and reliability, which can be used to refine the SLO. For example, additional SLIs could be added to measure the response time of certain critical transactions, or the number of successful requests per second.

  2. Tighten the SLO to match the application's observed reliability: If the application's observed reliability is consistently higher than the current SLO, it may be appropriate to tighten the SLO to more closely match the observed reliability. However, it is important to balance this against the needs of the business and the impact on velocity. Tightening the SLO too much could increase the risk of downtime or reduce the ability to innovate quickly, so it is important to find the right balance.

In summary, to more closely align a web application's SLO with its observed reliability, it is recommended to implement and measure additional SLIs for the application, and potentially tighten the SLO to match the observed reliability, while balancing the needs of the business and the impact on velocity.