Tiger Capital: EMR Cluster for Persistent Data Storage & Analytics Support

EMR Cluster for Persistent Data Storage & Analytics Support

Question

Tiger Investments (TI) is a private equity trust manager specializing in border market investments.

The Group is considered a pioneer investor in Southeast Asia's Greater Sub-region and the Caribbean.

Tiger Capital creates private equity funds targeting pre-emerging, post-conflict or post-disaster economies that are undergoing transition and are poised for rapid growth. The funds invest commercially in basic businesses, targeting attractive economic and social returns.

Tiger Capital invests through a diversity of financial instruments including equity, and debt TI is planning to launch EMR cluster to complement their ETL workloads running on Data Pipeline.

The Team is looking for storing persistent data complemented with server-side encryption, read-after-write consistency, and list consistency and enables Data Lake for the enterprise to support analytics.

Select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: B.

Option A is incorrect - Provides Ephemeral storage can be enabled through HDFS.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Option B is correct - Provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Option C is incorrect - Each node is created from an EC2 instance that comes with a preconfigured block of pre-attached disk storage called an instance store.

Data on instance store volumes persists only during the life of its EC2 instance.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Option D is incorrect - This is same as above defined in option.

C.

The local file system refers to a locally connected disk.

When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of pre-attached disk storage called an instance store.

Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.

The best option for Tiger Investments to store persistent data complemented with server-side encryption, read-after-write consistency, and list consistency and enable Data Lake for the enterprise to support analytics is:

B. EMRFS implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3

Explanation:

EMRFS (Elastic MapReduce File System) is a consistent view of Amazon S3 data from multiple Amazon EMR clusters. It allows for seamless integration between Hadoop applications running on EMR and data stored on S3. Using EMRFS, Tiger Investments can store their persistent data in Amazon S3, which provides low-cost and durable storage. Amazon S3 also supports server-side encryption to protect the data at rest.

EMRFS is an implementation of HDFS, which provides read-after-write consistency and list consistency, meaning that once data is written to S3, subsequent read operations will return the updated data, and data listing operations will show all the data that was written.

Using EMRFS to access S3 data also provides several benefits, including:

  • Ability to scale storage independently from compute resources: Data can be stored in S3 without having to worry about the size or capacity of the EMR cluster. This allows for more efficient resource utilization and cost savings.
  • Reduced data movement: EMRFS allows for data to be processed directly from S3 without the need to copy it to local disk storage first. This reduces data movement and speeds up processing.
  • Support for different file formats: S3 supports a wide range of file formats, including Parquet, ORC, and Avro, which can be used for efficient data storage and processing.

In summary, using EMRFS to access S3 data provides a scalable, durable, and cost-effective storage solution for Tiger Investments' persistent data needs while also providing the desired consistency and encryption options.