Predicting Global Purchasing Patterns with AWS Machine Learning - MLS-C01 Exam Prep

Simplifying Data Transformations and Attribute Combination for Daily Model Training

Question

You work for a machine learning team at a global retail auto parts chain.

Your team ingests purchasing data from its 100,000 global auto parts stores to S3 using Kinesis Data Firehose.

You are now ready to start training an improved machine learning model that will be used to predict purchasing patterns by global region.

The training data requires additional simple transformations.

Also, you will need to combine some data attributes.

Finally, your team expects to train the model on a daily basis. Based on a large number of stores plus changing data ingestion, which of the following options will require the least amount of administration and development effort?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: D.

Option A is incorrect.

Having 100,000 stores use Storage Gateway to move the data to S3 would require a tremendous administrative effort.

Option B is incorrect.

Using EMR for this solution would require administrative cost to build and maintain EMR.

Also, your development team would have to write the Apache Spark code to perform the transformations.

Option C is incorrect.

Using a fleet of EC2 instances would require the administrative cost of creating and maintaining the EC2 instances.

Also, your development team would have to write the transformation logic that runs on the EC2 instances.

Option D is CORRECT.

Kinesis Data Analytics can receive your data from Kinesis Data Firehose, transform it, and then write it to S3

The code needed to perform the transformations in Kinesis Data Analytics would be much simpler than the coding suggested in the other options.

Your machine learning model can then use the transformed data in S3 for training.

Reference:

Please see the Amazon Kinesis Data Analytics developer guide titled Example: Writing to an Amazon S3 Bucket.

Please refer to the Amazon Kinesis Data Firehose developer guide titled Using Amazon Kinesis Data Analytics.

The option that would require the least amount of administration and development effort is D. Create a Kinesis Data Analytics stream and use it as the destination of the Kinesis Data Firehose stream. Use Kinesis Data Analytics to transform the raw purchasing data attributes into transformed values using SQL and write the transformed data to S3.

Here's why:

Option A, having the stores capture their purchasing data locally on Storage Gateway and then load the data into S3 and transforming the data using Glue, requires the stores to capture their own data, which could lead to inconsistencies in data quality and format. Additionally, using Glue to transform the data requires development effort to create the transformations and to ensure they are running correctly.

Option B, creating an EMR cluster with Apache Spark installed to perform the transformation logic, is a more complex option that would require a significant amount of administration and development effort. The team would need to set up and maintain an EMR cluster, develop the transformation logic in Spark, and ensure the cluster is running correctly and efficiently.

Option C, creating a fleet of EC2 instances that run the transformation logic which transforms the incremental data records on S3 and write the transformed records to S3, is similar to Option B in that it requires significant administration and development effort to set up and maintain the EC2 instances, develop the transformation logic, and ensure everything is running correctly and efficiently.

Option D, creating a Kinesis Data Analytics stream and using it as the destination of the Kinesis Data Firehose stream, is the best option because it requires minimal administration and development effort. The team can use SQL to perform the necessary transformations on the incoming data, and Kinesis Data Analytics will automatically scale the processing capacity to handle the data volume. Additionally, since Kinesis Data Analytics is a managed service, there is minimal administration required to ensure it is running correctly. Overall, this option provides the necessary functionality with minimal development and administration effort.