Setting Up an EMR Cluster in AWS: Storing Metadata in a Central Repository

Central Repository for Storing EMR Cluster Metadata

Question

A company is planning on setting up an EMR cluster in AWS.

They need to store the metadata in a central repository.

How can they achieve this? Choose 2 answers from the options give below.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B and C.

The AWS Documentation mentions the following.

To use an external MySQL database or Amazon Aurora as your Hive metastore, you override the default configuration values for the metastore in Hive to specify the external database location, either on an Amazon RDS MySQL instance or an Amazon Aurora instance.

By default, Hive records metastore information in a MySQL database on the master node's file system.

But When a cluster terminates, all cluster nodes shut down, including the master node.

When this happens, local data is lost because node file systems use ephemeral storage.

If you need the metastore to persist, you must create an external metastore that exists outside the cluster.

You have two options for an external metastore:

AWS Glue Data Catalog

Amazon RDS or Amazon Aurora.

So to use an external MySQL database as your Hive metastore, you override the default configuration values for the metastore in Hive to specify the external database locationon an Amazon RDS MySQL instance.

Since the way to accomplish this is clearly given in the documentation , all other options are incorrect.

For more information on how to store the Hive metadata, please refer to the below URL.

https://aws.amazon.com/premiumsupport/knowledge-center/export-metastore-from-emr-to-rds/

The metadata in EMR cluster refers to the information about the data stored and processed in the cluster. It includes details about the structure, schema, format, and location of the data, as well as the operations performed on the data. A central repository is required to store this metadata, which can be used by different applications and users in the cluster. There are different options available to achieve this, but the two most common approaches are:

A. Create a bucket in S3: S3 is a highly scalable and durable object storage service in AWS that can be used to store and manage data. A bucket is a container for objects in S3, and it can be used to store metadata in a central repository. The metadata can be stored as objects in JSON or XML format, and they can be accessed and modified using the S3 APIs or the AWS Management Console. This approach is simple, cost-effective, and easy to maintain, as S3 provides high availability, durability, and security for the data.

C. Modify the hiveConfiguration.json file and reference it when you create the cluster: Hive is a data warehousing tool in EMR that provides a SQL-like interface to query and analyze data stored in Hadoop. It uses metadata to understand the structure and schema of the data, and it stores the metadata in a central repository called the Hive metastore. By default, the Hive metastore is stored locally on the cluster nodes, but it can be configured to use an external database such as MySQL or PostgreSQL. In this approach, the hiveConfiguration.json file is modified to specify the external database as the Hive metastore, and the file is referenced when the cluster is created. This approach provides more flexibility and scalability than storing metadata in S3, as it allows the use of a powerful database engine to manage the metadata.

B. Create a MySQL database in AWS RDS and D. Modify the Hive setup on the cluster are not the correct answers for this question as they do not directly address the requirement of storing metadata in a central repository. While it is possible to use MySQL or RDS to store the Hive metastore, it requires additional configuration and management efforts, and it may not provide the same level of scalability and availability as S3 or an external database.

In summary, the two correct answers for this question are A. Create a bucket in S3 and C. Modify the hiveConfiguration.json file and reference it when you create the cluster.