You work as a machine learning specialist at a retail clothing chain.
Your team builds a model that uses Kinesis Data Firehose to ingest transaction records from the chain's many (50,000) stores throughout the country.
You are building your training data store using your streaming transactions received from Kinesis Data Firehose.
The transaction records used for training require simple transformations, and you need to combine some attributes and drop other attributes.
Also, you need to retrain the model daily.
Which option will meet your requirements with the least effort?
Click on the arrows to vote for the correct answer
A. B. C. D.Correct Answer: A.
Option A is correct.
With this option, you can transform and combine attributes on your records in Kinesis Data Analytics using Apache Flink using Flink's built-in operators.
Using file sink integrations, Kinesis Data Analytics can store the transformed records to S3 for use in your model training.
This option requires the least amount of effort on your part.
Option B is incorrect.
This option requires the effort of writing transformation code to run on your ECS containers as well as administration effort to launch and maintain the ECS containers and tasks.
Option C is incorrect.
This option requires significantly more effort due to the launching and maintaining of the EMR cluster.
You would also have to create the Apache Hadoop map/reduce logic and Apache Presto scripts to do the transformations.
Option D is incorrect.
Using Kinesis Data Streams instead of Kinesis Data Firehose would require you to write a Kinesis Client Library application and run it on EC2 instances, which you would have to launch and maintain to transform your data and then write the transformed data to S3
This would be significantly more effort than writing Apache Flink using Flink's built-in operators.
References:
Please see the Amazon Kinesis Data Analytics FAQs (refer to the question “What integrations are supported in a Kinesis Data Analytics for Apache Flink application?”) (https://aws.amazon.com/kinesis/data-analytics/faqs/),
The Amazon Kinesis Data Analytics developer guide titled Example: Writing to an Amazon S3 Bucket (https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-s3.html),
The Amazon EMR page titled Apache Hadoop on Amazon EMR (https://aws.amazon.com/emr/features/hadoop/),
The Amazon EMR management guide titled What is EMR? (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html)
The correct answer for this scenario is option B, which involves streaming the transaction records from Kinesis Data Firehose to S3 and launching an ECS cluster to perform the transformation logic as tasks that transform and combine/drop attributes on the data records on S3.
Option A is incorrect because it involves sending the transaction records to Kinesis Data Analytics, where you can use Apache Flink to transform and combine/drop attributes. While this is a viable option, it requires setting up and managing a Kinesis Data Analytics application and may involve more effort than necessary for this use case.
Option C is also incorrect because it involves running an EMR cluster with Apache Hadoop and Apache Presto to perform the transformation logic. This option can be overkill for simple transformations and may involve more effort than necessary. Additionally, running an EMR cluster every day to transform and combine attributes can be resource-intensive and may not be the most cost-effective option.
Option D is incorrect because it involves using Kinesis Data Streams instead of Kinesis Data Firehose. While Kinesis Data Streams can be used for real-time data ingestion and processing, it requires additional setup and management. Using Glue ETL to transform and combine attributes can also be more complex than necessary for simple transformations.
Option B is the best option because it involves streaming the transaction records from Kinesis Data Firehose to S3, which is a simple and cost-effective solution. The transformation logic can then be performed using an ECS cluster that runs as tasks to transform and combine/drop attributes on the data records on S3. This approach allows for a flexible and scalable solution that can easily be retrained daily without incurring excessive costs.