Resizing Persistent Disk for Improved Performance: Reliability of High-Volume Enterprise Application

Resizing Persistent Disk for Improved Performance

Question

You are responsible for the reliability of a high-volume enterprise application.

A large number of users report that an important subset of the application's functionality '" a data intensive reporting feature '" is consistently failing with an HTTP 500 error.

When you investigate your application's dashboards, you notice a strong correlation between the failures and a metric that represents the size of an internal queue used for generating reports.

You trace the failures to a reporting backend that is experiencing high I/O wait times.

You quickly fix the issue by resizing the backend's persistent disk (PD)

How you need to create an availability Service Level Indicator (SLI) for the report generation feature.

How would you define it?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

C.

To create an availability Service Level Indicator (SLI) for the report generation feature, we need to define a metric that measures the availability of the feature. An SLI is a quantitative measure of a service's level of service, which is used to track performance over time and to set service level objectives (SLOs).

Given the scenario, we know that the report generation feature is failing with an HTTP 500 error and there is a correlation between the failures and a metric representing the size of an internal queue used for generating reports. Therefore, we need to define an SLI that measures the availability of the report generation feature based on this metric.

Option A, "As the I/O wait times aggregated across all report generation backends," does not directly measure the availability of the report generation feature. While high I/O wait times can cause the report generation feature to fail, it is not a direct measure of the feature's availability.

Option B, "As the proportion of report generation requests that result in a successful response," is a reasonable option for measuring availability. However, it may not capture all failures, such as those resulting from HTTP 500 errors that are not captured as successful responses.

Option C, "As the application's report generation queue size compared to a known-good threshold," is the best option for measuring availability in this scenario. As we know that there is a correlation between failures and the size of the report generation queue, measuring the queue size as a proportion of a known-good threshold can help us identify when the feature is likely to fail. If the queue size exceeds the threshold, it indicates that the feature is at risk of failure.

Option D, "As the reporting backend PD throughout capacity compared to a known-good threshold," does not directly measure the availability of the report generation feature. While increasing the reporting backend PD capacity may have resolved the issue, measuring PD throughput capacity is not a direct measure of the feature's availability.

In conclusion, option C, "As the application's report generation queue size compared to a known-good threshold," is the best option for defining an availability SLI for the report generation feature.