Minimizing Negative Impact of Canary Release on Users

Best Approach to Handle Spike in 500 Errors and Increased Latency

Question

You are running an experiment to see whether your users like a new feature of a web application.

Shortly after deploying the feature as a canary release, you receive a spike in the number of 500 errors sent to users, and your monitoring reports show increased latency.

You want to quickly minimize the negative impact on users.

What should you do first?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

D.

https://cloud.google.com/solutions/automated-canary-analysis-kubernetes-engine-spinnaker

Given that you have deployed a new feature as a canary release, it means that you have only released it to a small subset of users. The purpose of a canary release is to test new features in a controlled environment before releasing them to all users. In this scenario, you have noticed a spike in the number of 500 errors and increased latency shortly after the canary release.

In such situations, the first step that you should take is to minimize the negative impact on users. This means that you need to act quickly to reduce the number of errors and latency issues. Based on the options provided, the best option is to roll back the experimental canary release (option A).

Rolling back the canary release means that you are reverting to the previous version of the application that was working fine before the canary release. By doing so, you are eliminating the new feature as a potential cause of the issues that you are facing. This is the quickest and most effective way to minimize the negative impact on users.

Option B, starting to monitor latency, traffic, errors, and saturation is important, but it should be done after rolling back the canary release. This is because monitoring these metrics will help you identify the root cause of the issues that you are facing. However, if you continue to keep the canary release active, it will continue to cause issues, and monitoring the metrics will not help you solve the problem.

Option C, recording data for the postmortem document of the incident, is also important but should be done after you have resolved the issue. The purpose of the postmortem document is to analyze the incident and identify what went wrong and how it can be prevented in the future. However, this should not be your first priority as you need to resolve the issue first.

Option D, tracing the origin of 500 errors and the root cause of increased latency, is also important, but it should be done after rolling back the canary release. This is because the new feature could be the root cause of the issues, and if you keep it active, it will continue to cause issues, making it difficult to identify the actual cause. Once you have rolled back the canary release, you can then focus on identifying the root cause of the issues.