You have developed an application that processes a massive amount of process logs generated by web site and mobile app.
This application requires the ability to analyze petabytes of unstructured data using Amazon Elastic MapReduce.
The resultant data is stored on Amazon S3
You have deployed the c4.8xlarge Instance type, whose CPUs are mostly idle during the data processing.
Which of the below options would be the most cost-efficient way to reduce the log processing job's runtime?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer - C.
Option A is incorrect even though storing the files on an S3 storage class such as RRS would reduce the cost.
The problem in the scenario is that the provision of a large instance is wasted due to it being idle most of the time.
Option B is incorrect as adding more of the c4.8xlarge instance type in the task instance group would create more idle resources, which is - in fact - more costly.
Option C is CORRECT because, since the CPU's are mostly idle, it means that you have provisioned a larger instance that is under-utilized.
A better cost-efficient solution would be to use smaller instances.
For batch processing jobs such as the one mentioned in this scenario, you can use multiple t2 instances - which support the concept of CPU bursts - are ideal for situations where there are bursts of CPU during certain periods of time only.
Option D is incorrect even though storing the files on an S3 storage class such as RRS would reduce the cost.
The problem in the scenario is that the provision of a large instance is wasted due to it being idle most of the time.
For more information on resizing of the EC2 instances, please visit the URL given below-
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-resize.htmlThe most cost-efficient way to reduce the log processing job's runtime in this scenario would be to add additional c4.8xlarge instances by introducing a task instance group, as mentioned in option B.
Option A suggests creating log files with smaller sizes and storing them on Amazon S3. Although this would help with data organization and lifecycle management, it would not necessarily reduce the processing time of the data. Additionally, moving the files to RRS and then to Amazon Glacier vaults might affect the data availability and increase retrieval time.
Option C suggests using smaller instances that have higher aggregate I/O performance. While smaller instances might have higher I/O performance, they may not have enough resources to process the massive amount of data efficiently. Using more instances with the current instance type would be a more practical solution.
Option D suggests creating fewer, larger log files and compressing and storing them on the Amazon S3 bucket. While compressing the files may reduce storage costs, it may increase processing time since the system would need to decompress the files before analyzing them. Additionally, this approach could lead to data availability issues and longer retrieval times when accessing compressed files from Glacier.
Therefore, option B is the most cost-efficient way to reduce the log processing job's runtime in this scenario. Adding additional c4.8xlarge instances would increase the processing speed and reduce the load on the EMR cluster. The network performance of 10 Gigabit per EC2 instance would also further enhance data processing speed. This would allow the application to analyze petabytes of unstructured data faster, which is necessary for efficient log processing.