EMR Hadoop Ecosystem for Creating and Sharing Documents with Live Code and Visualizations

EMR Hadoop Ecosystem

Prev Question Next Question

Question

Allianz Financial Services (AFS) is a banking group offering end-to-end banking and financial solutions in South East Asia through its consumer banking, business banking, Islamic banking, investment finance and stock broking businesses as well as unit trust and asset administration, having served the financial community over the past five decades. AFS launched EMR cluster to support their big data analytics requirements.

AFS has multiple data sources built out of S3, SQL databases, MongoDB, Redis, RDS, other file systems.

AFS is looking for a web application to create and share documents that contain live code, equations, visualizations, and narrative text. Which EMR Hadoop ecosystem fulfills the requirements? select 1 option.

Answers

A. Apache Hive

B. Apache Zepplin

C. Jupyter Notebook

D. Apache Presto.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer : C.

Option A is incorrect - Hive is an open-source, data warehouse, and analytic package that runs on top of a Hadoop cluster.

Hive scripts use an SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions.

Hive enables you to avoid the complexities of writing Tez jobs based on directed acyclic graphs (DAGs) or MapReduce programs in a lower level computer language, such as Java.

Hive extends the SQL paradigm by including serialization formats.

You can also customize query processing by creating table schema that matches your data, without touching the data itself.

In contrast to SQL (which only supports primitive value types such as dates, numbers, and strings), values in Hive tables are structured elements, such as JSON objects, any user-defined data type, or any function written in Java.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html

Option B is incorrect -Use Apache Zeppelin as a notebook for interactive data exploration.

Zepplin can be accessed through web interface using a SSH tunnel to the EMR master node and a proxy connection

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zeppelin.html

Option C is correct -Jupyter Notebook is an open-source web application that you can use to create and share documents that contain live code, equations, visualizations, and narrative text.

Amazon EMR offers you two options to work with Jupyter notebooks:

EMR Notebook.

JupyterHub.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyter.html

Option D is incorrect -Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html

The requirement of AFS is to have a web application that allows creating and sharing documents that contain live code, equations, visualizations, and narrative text. This requirement can be fulfilled by using a popular data science notebook interface that supports these features. Among the given options, the notebook interface that fulfills this requirement is Jupyter Notebook.

Jupyter Notebook is an open-source web application that allows creating and sharing documents that contain live code, equations, visualizations, and narrative text. It supports many programming languages, including Python, R, and Julia, and provides a rich set of tools for data analysis, visualization, and machine learning. With Jupyter Notebook, users can write and execute code, create and share visualizations, and document their work using markdown.

Apache Hive is a data warehouse infrastructure built on top of Hadoop that provides data summarization, query, and analysis. It uses a SQL-like language called HiveQL to process structured and semi-structured data. While it can be used for data analysis and reporting, it does not support the features required for the web application described in the question.

Apache Zeppelin is a web-based notebook interface for data exploration, visualization, and collaboration. It supports multiple interpreters for data processing and provides a rich set of visualizations and reporting tools. However, it does not support the live code execution feature required by the question.

Apache Presto is a distributed SQL query engine that allows querying data across multiple data sources, including Hadoop, S3, and relational databases. While it can be used for data analysis and reporting, it does not provide the live code execution and visualization features required by the question.

In conclusion, the EMR Hadoop ecosystem component that fulfills the requirement of AFS to have a web application to create and share documents that contain live code, equations, visualizations, and narrative text is Jupyter Notebook.

Prev Question Next Question