Tiger Investments: Tracking Consistency in EMRFS for Big Data Analytics

Tracking Consistency in EMRFS for Big Data Analytics

Question

Tiger Investments (TI) is a private equity trust manager specializing in border market investments.

The Group is considered a pioneer investor in Southeast Asia's Greater Sub-region and the Caribbean.

Tiger Capital creates private equity funds targeting pre-emerging, post- conflict or post-disaster economies that are undergoing transition and are poised for rapid growth.

The funds invest commercially in basic businesses, targeting attractive economic and social returns.

Tiger Capital invests through a diversity of financial instruments including equity, and debt TI launched EMR 3.2.1 using EMRFS storage to support their real time data analytics.

IT team observed that once objects are added to EMRFS in one operation and then immediately list objects in a subsequent operation, the list and the set of objects processed is incomplete most of the times.

This is a continuous problem that TI team is facing mostly when running multi-step sequential steps in extract-transform-load (ETL) data processing pipelines.

How can the team track consistency.

Select 2 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answer : C, E.

Option A is incorrect - Ephemeral storage is not used for long running EMR clusters.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Option B is incorrect -local file systems is not used for long running EMR clusters.

Option C is correct - Consistent view allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects written by or synced with EMRFS.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html

Option D is incorrect -ephemeral storage does not provide you this facility.

Option E is correct -When you create a cluster with consistent view enabled, Amazon EMR uses an Amazon DynamoDB database to store object metadata and track consistency with Amazon S3

Option A and Option B are not viable solutions for tracking consistency in the EMRFS storage as they involve changing the storage type which may not be suitable for the requirements of Tiger Investments (TI).

Option C, "Enable Consistent View," is a feature in EMRFS that can help track consistency in the storage. This feature enables EMR clusters to check for list and read-after-write consistency, which means that when an object is written to the EMRFS storage, it will be immediately available for listing and reading by all subsequent operations. This feature ensures that all objects are consistently visible across all nodes in the cluster.

Option D and Option E suggest the use of Amazon DynamoDB to track consistency with either Ephemeral storage or EMRFS storage. Amazon DynamoDB is a highly scalable NoSQL database that can be used to store object metadata and track consistency across different storage types. This option is viable for Tiger Investments (TI) as it ensures consistency across different storage types, and DynamoDB is a highly scalable and reliable solution.

In conclusion, the two viable options for Tiger Investments (TI) to track consistency in EMRFS storage are:

  1. Enable Consistent View, which allows EMR clusters to check for list and read-after-write consistency.
  2. Use Amazon DynamoDB to store object metadata and track consistency with EMRFS Storage or Ephemeral Storage.