EMR Hadoop Ecosystem for Big Data Analytics | AFS Case Study

EMR Hadoop Ecosystem for Big Data Analytics

Question

Allianz Financial Services (AFS) is a banking group offering end-to-end banking and financial solutions in South East Asia through its consumer banking, business banking, Islamic banking, investment finance and stock broking businesses as well as unit trust and asset administration, having served the financial community over the past five decades. AFS launched EMR cluster to support their big data analytics requirements.

AFS has multiple data sources built out of S3, SQL databases, MongoDB, Redis, RDS, other file systems.

AFS is looking for a web application to create and share documents that contain live code, equations, visualizations, and narrative text.

Also need to allow to host multiple instances of a single-user notebook server and Amazon EMR creates a Docker container on the cluster's master node and sparkmagic as key component to run within the container. Which EMR Hadoop ecosystem fulfills the requirements? Select 2 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answer: B, D.

Option A is incorrect.

Hive is an open-source, data warehouse, and analytic package that runs on top of a Hadoop cluster.

Hive scripts use an SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions.

Hive enables you to avoid the complexities of writing Tez jobs based on directed acyclic graphs (DAGs) or MapReduce programs in a lower level computer language, such as Java.

Hive extends the SQL paradigm by including serialization formats.

You can also customize query processing by creating table schema that matches your data, without touching the data itself.

In contrast to SQL (which only supports primitive value types such as dates, numbers, and strings), values in Hive tables are structured elements, such as JSON objects, any user-defined data type, or any function written in Java.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html

Option B is correct.

EMR Notebooks is a Jupyter Notebook environment built in to the Amazon EMR console that allows you to quickly create Jupyter notebooks, attach them to Spark clusters, and then open the Jupyter Notebook editor in the console to remotely run queries and code.

An EMR notebook is saved in Amazon S3 independently from clusters for durable storage, quick access, and flexibility.

You can have multiple notebooks open, attach multiple notebooks to a single cluster, and re-use a notebook on different clusters.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyter-emr-managed-notebooks.html

Option C is incorrect.

D3.js (or just D3 for Data-Driven Documents) is a JavaScript library for producing dynamic, interactive data visualizations in web browsers.

It makes use of the widely implemented SVG, HTML5, and CSS standards.

D3.js provides wonderful visualizations but is not a EMR ecosystem component and need to have a javaScript server to support execution of the code.

https://d3js.org/

Option D is correct.

Jupyter Notebook is an open-source web application that you can use to create and share documents that contain live code, equations, visualizations, and narrative text.

JupyterHub allows you to host multiple instances of a single-user Jupyter notebook server.

When you create a cluster with JupyterHub, Amazon EMR creates a Docker container on the cluster's master node.

JupyterHub, all the components required for Jupyter, and Sparkmagic run within the container.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub.html

Option E is incorrect: Zeppelin is web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

In case of zeppelin, docker containers are not created for each users on cluster master node.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zeppelin.html

To fulfill AFS's requirements for a web application to create and share documents containing live code, equations, visualizations, and narrative text, as well as allowing for the hosting of multiple instances of a single-user notebook server, we need to consider EMR Hadoop ecosystem components that support these features.

Option A: Apache Hive is a data warehouse system built on top of Hadoop. It is primarily used to query and analyze large datasets stored in Hadoop Distributed File System (HDFS) or S3. While it can generate reports and visualizations, it is not designed to create and share documents containing live code, equations, and narrative text. Therefore, it is not a suitable option for this requirement.

Option B: EMR Notebook is a web-based notebook application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It supports multiple programming languages and libraries, including Spark, Python, R, and Scala, and provides a pre-configured environment for data exploration and analysis. It also allows for the hosting of multiple instances of a single-user notebook server. Therefore, EMR Notebook is a suitable option for this requirement.

Option C: D3.js is a JavaScript library for creating interactive data visualizations in web browsers. While it is a useful tool for creating visualizations, it does not provide support for creating and sharing documents containing live code, equations, and narrative text. Therefore, it is not a suitable option for this requirement.

Option D: JupyterHub is a multi-user server for Jupyter notebooks that allows for the hosting of multiple instances of a single-user notebook server. It provides a web-based interface for creating and sharing documents that contain live code, equations, visualizations, and narrative text. Jupyter notebooks support multiple programming languages and libraries, including Spark, Python, R, and Scala. Therefore, JupyterHub is a suitable option for this requirement.

Option E: Zeppelin is a web-based notebook application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It supports multiple programming languages and libraries, including Spark, Python, R, and Scala, and provides a pre-configured environment for data exploration and analysis. It also allows for the hosting of multiple instances of a single-user notebook server. Therefore, Zeppelin is a suitable option for this requirement.

Therefore, the two suitable options to fulfill AFS's requirements are EMR Notebook and either JupyterHub or Zeppelin.