Ensuring Resilient System Performance for Increased User Load

Developing a Resiliency Testing Strategy for Maintaining SLA

Question

Your company's user-feedback portal comprises a standard LAMP stack replicated across two zones.

It is deployed in the us-central1 region and uses autoscaled managed instance groups on all layers, except the database.

Currently, only a small group of select customers have access to the portal.

The portal meets a 99,99% availability SLA under these conditions.

However next quarter, your company will be making the portal available to all users, including unauthenticated users.

You need to develop a resiliency testing strategy to ensure the system maintains the SLA once they introduce additional user load.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

B.

The correct answer for this scenario would be D: Capture existing users input, and replay captured user load until resource utilization crosses 80%. Also, derive estimated number of users based on existing user's usage of the app, and deploy enough resources to handle 200% of expected load.

Explanation: The scenario mentions that the current portal meets a 99.99% availability SLA under the conditions of a small group of select customers having access to it. The challenge is to develop a resiliency testing strategy to ensure that the system maintains the SLA once it is made available to all users, including unauthenticated users.

Option A suggests capturing the existing user input and replaying the captured user load until autoscale is triggered on all layers. At the same time, resources in one of the zones are terminated. This strategy is not optimal because it does not take into account the increase in user load, and terminating resources in one zone would lead to a partial outage.

Option B suggests creating synthetic random user input and replaying synthetic load until autoscale logic is triggered on at least one layer. At the same time, chaos is introduced to the system by terminating random resources on both zones. This strategy is not ideal as it does not replicate the actual user load and is based on synthetic data. Additionally, introducing chaos could lead to unpredictable behavior, making it difficult to draw meaningful conclusions from the testing.

Option C suggests exposing the new system to a larger group of users and increasing the group size each day until autoscale logic is triggered on all layers. At the same time, random resources on both zones are terminated. This strategy is not optimal as it does not take into account the expected increase in user load, and the approach of terminating resources randomly can lead to unpredictable results.

Option D suggests capturing existing user input and replaying the captured user load until resource utilization crosses 80%. This approach ensures that the system is tested with actual user data and the load is increased gradually until it reaches a predetermined threshold. Additionally, it suggests deriving an estimated number of users based on existing user's usage of the app, which is a good way to estimate the expected increase in user load. Finally, the recommendation to deploy enough resources to handle 200% of expected load ensures that the system is adequately scaled to handle the anticipated load.

Therefore, Option D is the best resiliency testing strategy in this scenario as it ensures that the system is tested with actual user data, takes into account the expected increase in user load, and ensures that the system is adequately scaled to handle the anticipated load.