Setting Up an EMR Cluster with Machine Learning Capabilities for Fraud Detection | AWS Certified Big Data - Specialty Exam

Ideal Application for an EMR Cluster with Machine Learning Capabilities

Question

A company is planning on setting up an EMR Cluster in AWS.

They need to ensure that the cluster to have machine learning capabilities.

The data being ingested will be from various log files from EC2 Instances located in AWS.

The data is being used to check for any sort of fraud detection.

Which of the following would be the ideal application to use along with the underlying EMR Cluster?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - C.

The AWS Documentation mentions the following.

Spark natively supports applications written in Scala, Python, and Java.

It also includes several tightly integrated libraries for SQL (Spark SQL), machine learning (MLlib), stream processing (Spark Streaming), and graph processing (GraphX)

These tools make it easier to leverage the Spark framework for a wide variety of use cases.

The ideal solution to use which has integrated Machine Learning capabilities is Apache Spark , hence the other options are not the most viable ones.

For more information on EMR spark, please refer to the below URL.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html

To enable machine learning capabilities in an EMR Cluster, Spark is the ideal application to use. Spark is an open-source distributed computing system that provides fast and general-purpose processing of large-scale data sets. It also includes machine learning libraries, such as MLlib, that allow for efficient data processing and analysis.

In this scenario, the data being ingested is from various log files from EC2 instances located in AWS, which is a big data use case. By using Spark on the EMR cluster, the company can process the large-scale data efficiently and apply machine learning algorithms to detect any fraud patterns.

Hive is a data warehousing system for querying and analyzing large datasets stored in distributed storage systems. It is not a machine learning application and does not provide the necessary capabilities to enable machine learning on the EMR cluster.

Presto is an open-source distributed SQL query engine designed for running interactive analytic queries against data sources of all sizes. While Presto is fast and efficient at processing large amounts of data, it is not a machine learning application and does not provide the necessary libraries and tools for machine learning.

HBase is a NoSQL database designed to handle large volumes of structured data across many commodity servers. It is not a machine learning application and does not provide the necessary libraries and tools for machine learning.

Therefore, the best option for this scenario would be to use Spark, as it provides the necessary libraries and tools for machine learning and is ideal for processing large-scale data sets.