EMR Hadoop Ecosystem for Interactive Data Exploration | AFS Case Study

EMR Hadoop Ecosystem for Interactive Data Exploration

Question

Allianz Financial Services (AFS) is a banking group offering end-to-end banking and financial solutions in South East Asia through its consumer banking, business banking, Islamic banking, investment finance and stock broking businesses as well as unit trust and asset administration, having served the financial community over the past five decades. AFS launched EMR cluster to support their big data analytics requirements.

AFS has multiple data sources built out of S3, SQL databases, MongoDB, Redis, RDS, other file systems.

AFS is looking for a service that supports interactive data exploration and can be accessed through web interface using a SSH tunnel to the EMR master node and a proxy connection Which EMR Hadoop ecosystem fulfills the requirements?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer : D.

Option A is incorrect.

Hive is an open-source, data warehouse, and analytic package that runs on top of a Hadoop cluster.

Hive scripts use an SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions.

Hive enables you to avoid the complexities of writing Tez jobs based on directed acyclic graphs (DAGs) or MapReduce programs in a lower level computer language, such as Java.

Hive extends the SQL paradigm by including serialization formats.

You can also customize query processing by creating table schema that matches your data, without touching the data itself.

In contrast to SQL (which only supports primitive value types such as dates, numbers, and strings), values in Hive tables are structured elements, such as JSON objects, any user-defined data type, or any function written in Java.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html

Option B is incorrect.Apache HBase is an open source, non-relational, distributed database developed as part of the Apache Software Foundation's Hadoop project.

HBase runs on top of Hadoop Distributed File System (HDFS) to provide non-relational database capabilities for the Hadoop ecosystem.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase.html

Option C is incorrect.

Apache HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.

HCatalog has a REST interface and command line client that allows you to create tables or do other operations.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hcatalog.html

Option D is correct.

Apache Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html

Note:

Interactive data exploration is performed by apache presto, apache drill.

We are not speaking of data science.

We are speaking of analyzing data from multiple data sources.

if its executing some machine learning, zepplin comes into picture.

for interactive data exploration on EMR, generally we refer to Presto.

In this scenario, AFS is looking for a service that supports interactive data exploration and can be accessed through web interface using a SSH tunnel to the EMR master node and a proxy connection. Given the information about the data sources used by AFS, the EMR Hadoop ecosystem that can fulfill these requirements is Apache Presto (Option D).

Apache Presto is a distributed SQL query engine designed for interactive querying of big data sets. It is highly optimized for ad-hoc queries and can connect to a wide range of data sources, including S3, SQL databases (such as RDS), MongoDB, Redis, and other file systems. This makes it a good fit for the data sources used by AFS.

Moreover, Presto is designed to work with a web interface, which can be accessed through a proxy connection. Additionally, it supports SSH tunneling to connect to the EMR master node, which fulfills the requirements mentioned by AFS.

On the other hand, Apache Hive (Option A) is a data warehousing and SQL-like query language for Hadoop. It is optimized for batch processing and not for interactive querying. Apache HBase (Option B) is a NoSQL database that is optimized for real-time read/write access to large datasets, but it is not designed for ad-hoc queries. Apache HCatalog (Option C) is a metadata management system for Hadoop, which provides a table abstraction for data stored in Hadoop Distributed File System (HDFS) and other data stores, but it does not support interactive querying.

Therefore, based on the information provided, the best option for AFS to fulfill their requirements is Apache Presto (Option D).