Auto Scaling Group Overscaling - Fixing the Issue

Dealing with Auto Scaling Group Overscaling

Prev Question Next Question

Question

Your application's Auto Scaling Group scales up too much and stays scaled up even when traffic decreases.

What should you do to fix this?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B.

The major problem is that the right metric is not being used for the auto-scaling activities.

Option A is not valid because as the metric is still wrong, even if the cooldown timer is enlarged, the ASG may still keep scaling up when traffic is decreasing.

Option C is not valid because increasing the Cloudwatch alarm metric will not ensure that the instances scale down when the traffic decreases.

Option D is not valid because the question does not mention any constraints that point to the instance size.

For an example of using custom metrics for scaling in and out, please follow the below link for a use case.

https://blog.powerupcloud.com/aws-autoscaling-based-on-database-query-custom-metrics-f396c16e5e6a

The best answer to the problem of an Auto Scaling Group scaling up too much and staying scaled up even when traffic decreases is A. Set a longer cooldown period on the Group, so the system stops overshooting the target capacity.

Explanation: An Auto Scaling Group scales up and down according to the defined policies that use CloudWatch metrics as triggers. The group tries to maintain the desired number of instances, but it needs to balance the need for sufficient capacity with the need to avoid unnecessary costs. Therefore, it adjusts the number of instances based on load, but it needs to avoid oscillating too much.

The cooldown period is the time that the Auto Scaling Group waits after a scaling operation to launch or terminate instances before it considers another scaling event. The cooldown period helps avoid over-provisioning or under-provisioning, preventing over-scaling or under-scaling. The default cooldown period is 300 seconds (5 minutes), but you can set it to a longer or shorter period, depending on your application's characteristics.

In this case, the problem is that the scaling system doesn't allow enough time for new instances to begin servicing requests before measuring aggregate load again. The system triggers another scaling event, even if the new instances have not yet started serving traffic, and as a result, the Group scales up too much and stays scaled up even when traffic decreases.

Therefore, to fix this, you need to set a longer cooldown period on the Auto Scaling Group, so the system stops overshooting the target capacity. A longer cooldown period allows enough time for new instances to start serving traffic before the next scaling event. By setting the cooldown period to a longer duration, you can prevent unnecessary scaling events, reduce costs, and improve application performance.

Option B is incorrect because it suggests calculating the bottleneck or constraint on the compute layer and selecting that as the new metric. While this may help identify the root cause of the scaling problem, it does not solve the issue of over-scaling.

Option C is incorrect because it suggests raising the CloudWatch Alarms threshold associated with the Auto Scaling Group. While this may prevent unnecessary scaling events, it does not solve the issue of over-scaling.

Option D is incorrect because it suggests using larger instances instead of lots of smaller ones to reduce the OS overhead. While this may help optimize resource usage, it does not solve the issue of over-scaling.