A team is building an EMR Cluster in AWS.
One of the major requirements is to ensure that data is available even after the EMR cluster is torn down.
Which of the following storage option should be used to fulfil this requirement?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer - D.
The AWS Documentation mentions the following.
EMRFS is an implementation of the Hadoop file system used for reading and writing regular files from Amazon EMR directly to Amazon S3
EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.
All other options are incorrect because these are all temporary storage options.
For more information on EMR File systems, please refer to the below URL.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.htmlThe correct answer for this question is D. EMRFS.
EMRFS (Elastic MapReduce File System) is an implementation of the Hadoop FileSystem API that allows EMR clusters to use Amazon S3 as a durable and highly available data store. EMRFS decouples the Hadoop filesystem namespace from the underlying storage layer, making it possible to store data in S3, while still using the familiar Hadoop tools and APIs to access and process that data.
When data is stored in S3, it is automatically replicated across multiple Availability Zones within a region, providing high durability and availability. This means that even if an EMR cluster is terminated or fails, the data stored in S3 will remain intact.
On the other hand, options A and B are not ideal for storing data that needs to be available after the EMR cluster is torn down. Instance store (option A) is temporary storage that is associated with the life of an EC2 instance. When an instance is terminated or fails, any data stored on its instance store volumes is lost. EBS volumes (option B) are persistent storage volumes that can be attached to EC2 instances. However, they are not highly durable by default, and need to be backed up to S3 or other storage solutions to ensure data availability.
Option C, Local file system, is also not a good choice for storing data that needs to be available after the EMR cluster is terminated. Local file system refers to the storage that is directly attached to the EC2 instances in the cluster, which is not durable or highly available.
Therefore, the best option for ensuring data availability even after the EMR cluster is torn down is to use EMRFS, which provides decoupled namespace and highly available storage with S3.