A team is building an EMR Cluster in AWS.
The cluster has already been created based on current capacity needs.
After a duration of 3 months, based on the new storage requirements, it seems that the cluster does not have the required amount of storage based on these requirements.
Which of the following can be used to ensure the storage of the cluster meets the new requirements with the least effect on the cluster.
Choose 2 answers from the options given below.
Click on the arrows to vote for the correct answer
A. B. C. D.Answer - A and B.
The AWS Documentation mentions the following.
If the calculated HDFS capacity value is smaller than your data, you can increase the amount of HDFS storage in the following ways:
Creating a cluster with additional EBS volumes or adding instance groups with attached EBS volumes to an existing cluster.
Adding more core nodes.
Choosing an EC2 instance type with greater storage capacity.
Using data compression.
Changing the Hadoop configuration settings to reduce the replication factor.
Options C and D are incorrect because this would have a large operational impact on the cluster.
For more information on EMR Instance guidelines, please refer to the below URL.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.htmlTo increase the storage capacity of an EMR cluster in AWS, there are multiple ways to do it. However, the question asks to identify the methods that have the least impact on the existing cluster.
A) If the replication factor is high, you can reduce it on the cluster. This statement is incorrect because reducing the replication factor will result in a loss of data redundancy, which can cause data loss in the event of a failure. Hence, it is not a recommended solution to increase storage capacity without losing data redundancy.
B) Add more nodes to the cluster. Adding more nodes to the cluster is a viable solution to increase storage capacity as it does not require any changes to the existing nodes in the cluster. Adding new nodes is a relatively straightforward process and can be done quickly, minimizing the impact on the existing cluster.
C) Recreate the cluster with more EBS volumes. Recreating the cluster with more EBS volumes will result in a significant impact on the existing cluster, as it requires creating new EBS volumes, copying data to the new volumes, and then terminating the old cluster. This process can be time-consuming and can result in significant downtime, which makes it an unsuitable solution.
D) Recreate the cluster with more EC2 Instances. Recreating the cluster with more EC2 instances is a viable solution to increase storage capacity, but it is not the most efficient solution. Similar to option C, it involves creating new instances, copying data to the new instances, and then terminating the old cluster. This process can be time-consuming and can result in significant downtime, which makes it an unsuitable solution.
Therefore, the correct answers to the question are B) Add more nodes to the cluster and D) Recreate the cluster with more EC2 Instances. However, adding more nodes to the existing cluster is the most efficient and least impactful solution to increase the storage capacity of the cluster.