A team has a 10 PB data store in EMR.
All the data is stored in S3
There is a need for a data engineer to perform interactive analysis.
The data engineer already has access to the EMR cluster via the AWS Console.
Which of the following can be used by the data engineer for interactive analysis?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer - A.
Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources.
Option B is incorrect since this is a tool for transferring data between Amazon S3, Hadoop, HDFS, and RDBMS databases.
Option C is incorrect since this is used to manage and coordinate Hadoop jobs.
Option D is incorrect since this is used to transform large data sets without having to write complex code in a lower level computer language like Java.
For more information on AWS Presto, please refer to the below URL.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.htmlThe best solution for the data engineer to perform interactive analysis on the 10 PB data store in EMR stored in S3 would be using Presto.
Presto is an open-source, distributed SQL query engine that enables interactive analysis of data from various sources such as Hadoop Distributed File System (HDFS), Apache Cassandra, and Amazon S3. Presto is designed for high performance and scalability and can handle petabytes of data with ease.
Apache Sqoop is a tool used to transfer bulk data between Hadoop and structured data stores such as relational databases. It is not designed for interactive analysis.
Oozie is a workflow scheduler system used to manage Hadoop jobs. It is used to automate and coordinate Hadoop jobs and is not designed for interactive analysis.
Apache Pig is a high-level platform for creating MapReduce programs used to analyze large data sets. It is a procedural language used to transform and analyze data, but it is not designed for interactive analysis.
Therefore, the correct answer is A. Presto, as it provides the best solution for the data engineer to perform interactive analysis on the 10 PB data store in EMR stored in S3.