AWS EMR Cluster Configuration for Allianz Financial Services (AFS) | Optimize TCO

EMR Cluster Configuration for Allianz Financial Services (AFS)

Question

Allianz Financial Services (AFS) is a banking group offering end-to-end banking and financial solutions in South East Asia through its consumer banking, business banking, Islamic banking, investment finance and stock broking businesses as well as unit trust and asset administration, having served the financial community over the past five decades. AFS is planning to host an EMR cluster to run their analytical workloads which complements their existing ETL jobs built on Data Pipeline.

AFS was working with you to identify and understand the configuration needed to support their workload. Cluster to coordinate the distribution of data and tasks among other nodes for processing both for daily, weekly and monthly jobs Fixed workload of 4 compute and storage nodes to address their daily workload 2 more compute nodes to support their weekly and monthly workloads. These nodes supports both Data Pipeline and EMR processing and storage Please identify the minimum artifacts in EMR to also optimize overall TCO.

Select 2 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

Answer: A and E.

Option A is correct - Master node manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters

Option B is incorrect - Primary node is not a component of EMR.

It is component of standard Hadoop.

Option C is incorrect - Option C is incorrect, since 4 storage and compute nodes are mandatory as per the question.

Option D is incorrect - Task node is a node with software components that only runs tasks and does not store data in HDFS.

This cannot support data requirements because task nodes provide only compute not storage.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters

Option E is correct - core nodes in EMR include both core nodes and task nodes.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters

Option F is incorrect - Data Pipeline uses the same configuration of EMR that is running.

Data Pipeline cannot trigger core and task nodes alone when configured to run existing EMR instance.

https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-launch-emr-jobflow.html

For hosting EMR cluster for AFS, the minimum artifacts required to optimize overall TCO are:

Option C: 2-4 core nodes with software components that run tasks and store data to support partial workloads (daily/weekly/monthly). Option E: 4-6 core nodes in overall with software components that run tasks and store data to support full workloads.

Explanation:

  • Core nodes are responsible for storing and processing data in Hadoop Distributed File System (HDFS).
  • Task nodes are responsible for running tasks assigned to them by the EMR cluster.
  • The master node coordinates the distribution of data and tasks among other nodes for processing.

Based on the requirement mentioned, the following artifacts are needed:

  1. Cluster to coordinate the distribution of data and tasks among other nodes for processing both for daily, weekly and monthly jobs This suggests that a master node is required to coordinate the distribution of data and tasks among other nodes for processing.

  2. Fixed workload of 4 compute and storage nodes to address their daily workload This suggests that 4 core nodes are required to store and process data for daily jobs.

  3. 2 more compute nodes to support their weekly and monthly workloads. This suggests that 2 more core nodes are required to store and process data for weekly and monthly jobs.

  4. These nodes support both Data Pipeline and EMR processing and storage This suggests that the same nodes can be used for both Data Pipeline and EMR processing and storage.

Therefore, to optimize overall TCO, Option C and Option E are the minimum artifacts required. Option C provides the necessary number of core nodes for the partial workload and Option E provides additional core nodes for the full workload.

Option A and B are incorrect as both suggest the requirement for a single master node for coordinating the distribution of data and tasks, which is not enough to support the given workload.

Option D is incorrect as it suggests the requirement for 4-6 task nodes which are not required based on the given requirement.

Option F is incorrect as it suggests that Data Pipeline launches its own nodes, which is not required based on the given requirement.