You support a stateless web-based API that is deployed on a single Compute Engine instance in the europe-west2-a zone.
The Service Level Indicator (SLI) for service availability is below the specified Service Level Objective (SLO)
A postmortem has revealed that requests to the API regularly time out.
The time outs are due to the API having a high number of requests and running out memory.
You want to improve service availability.
What should you do?
Click on the arrows to vote for the correct answer
A. B. C. D.C.
Based on the postmortem, the cause of the timeouts is that the API is running out of memory due to the high number of requests. To improve service availability, we need to address this issue.
Option A, changing the SLO to match the measured SLI, does not address the root cause of the issue. It would be a quick fix to make it seem like the service is meeting its availability goals, but it does not actually improve the service's availability.
Option B, moving the service to higher-specification compute instances with more memory, could address the root cause of the issue. By providing more memory, the API would be able to handle more requests without running out of memory. However, this solution may be more costly, and there is a chance that it may not scale as effectively in the long term.
Option C, setting up additional service instances in other zones and load balancing the traffic between all instances, could also address the root cause of the issue. By distributing the requests across multiple instances, we can decrease the likelihood of any single instance running out of memory due to a high number of requests. Additionally, this solution can provide better availability by having multiple instances running in different zones to ensure high availability.
Option D, setting up additional service instances in other zones and using them as a failover in case the primary instance is unavailable, does not address the root cause of the issue. It is a good option for disaster recovery, but it does not improve the service's availability during normal operations.
Therefore, the best option is C, setting up additional service instances in other zones and load balancing the traffic between all instances. This solution addresses the root cause of the issue and provides better availability through distributed instances.