You support a production service that runs on a single Compute Engine instance.
You regularly need to spend time on recreating the service by deleting the crashing instance and creating a new instance based on the relevant image.
You want to reduce the time spent performing manual operations while following Site Reliability Engineering principles.
What should you do?
Click on the arrows to vote for the correct answer
A. B. C. D.A.
The best answer to this question is C. Add a Load Balancer in front of the Compute Engine instance and use health checks to determine the system status.
Explanation:
Option A - Filing a bug with the development team to find the root cause of the crashing instance is a good practice, but it does not address the problem of reducing the time spent performing manual operations while following Site Reliability Engineering principles.
Option B - Creating a Managed Instance Group with a single instance and using health checks to determine the system status is a good solution for managing an instance in a group. However, this solution does not address the issue of reducing the time spent performing manual operations while following Site Reliability Engineering principles.
Option C - Adding a Load Balancer in front of the Compute Engine instance and using health checks to determine the system status is the best solution for reducing the time spent performing manual operations while following Site Reliability Engineering principles. The Load Balancer can automatically detect the failed instance and route traffic to a healthy instance. Also, health checks can be configured to detect when the instance is unhealthy, and the Load Balancer will automatically replace the unhealthy instance with a healthy one. This solution provides automated recovery of the failed instance, reducing the need for manual intervention.
Option D - Creating a Stackdriver Monitoring dashboard with SMS alerts to be able to start recreating the crashed instance promptly after it was crashed is a good practice, but it does not address the issue of reducing the time spent performing manual operations while following Site Reliability Engineering principles.
In summary, Option C is the best solution for reducing the time spent performing manual operations while following Site Reliability Engineering principles. By adding a Load Balancer in front of the Compute Engine instance and using health checks to determine the system status, the failed instance can be automatically detected and replaced, reducing the need for manual intervention.