A data analytics team wishes to develop a solution for real-time data export from DynamoDB table and an import into an S3 bucket in parquet file format for further data analytics processing.
Which solution implements these requirements?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: D.
Option A is incorrect because using AWS Data Pipeline with EMR is a batch solution.
This does not meet the real-time streaming requirements.
Option B is incorrect because using AWS Step functions with AWS Glue is a batch solution.This does not meet the real-time streaming requirements.
Option C is incorrect because more complex code is required to implement data conversion in the Lambda function.Additionally, writing records to S3 in batches does not meet the real-time streaming requirements.
Option D is correct because the Lambda function can be used to read data from the stream and write data to Kinesis Data Firehose.Kinesis data firehose can perform data conversion to Parquet data format and support native data write to S3.
Reference:
https://aws.amazon.com/blogs/big-data/how-factset-automated-exporting-data-from-amazon-dynamodb-to-amazon-s3-parquet-to-build-a-data-analytics-platform/ https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.htmlThe correct answer for this scenario is option D - Enable DynamoDB Streams. Create Lambda function to poll DynamoDB stream and push items to Kinesis Data Firehose. Use Data Firehose to perform data conversion and store the data in S3.
Explanation:
Option A - Use AWS Data Pipeline service to manage the Amazon EMR jobs for data export, conversion, and data import.
AWS Data Pipeline is a web service that enables you to schedule regular data movement and data processing activities in the AWS Cloud. It is used to automate the movement and transformation of data from one location to another. While this solution would allow the data to be exported from DynamoDB, converted to Parquet, and imported to S3, it is not a real-time solution, and may not be appropriate for the requirement of the data analytics team.
Option B - Use AWS Step functions to manage the data export, conversion, and data import workflow. Use AWS Glue job to perform the data export, conversion, and data import.
AWS Step Functions is a web service that enables you to coordinate distributed applications using visual workflows. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. While both services are very useful, this solution would require a lot of configuration, management, and monitoring. It also may not be appropriate for the requirement of the data analytics team.
Option C - Enable DynamoDB Streams. Create a Lambda function to poll the DynamoDB stream, perform data conversion, and deliver batch records to S3.
DynamoDB Streams is a feature that allows you to capture data modification events in DynamoDB tables in near real-time. By enabling DynamoDB Streams and creating a Lambda function to poll the stream, you can capture the changes to the DynamoDB table and perform real-time data export. However, this solution does not perform data conversion and it may require the Lambda function to perform additional work to meet the requirement.
Option D - Enable DynamoDB Streams. Create Lambda function to poll DynamoDB stream and push items to Kinesis Data Firehose. Use Data Firehose to perform data conversion and store the data in S3.
This solution utilizes the same functionality as Option C, but adds Kinesis Data Firehose to perform the data conversion and S3 storage. Kinesis Data Firehose is a fully managed service that can capture and transform streaming data in real-time. By using Kinesis Data Firehose, you can easily convert the data to Parquet format and store it in S3 for further analysis. This solution is the most appropriate solution for the requirement of the data analytics team.
In summary, Option D is the best solution for real-time data export from DynamoDB table and an import into an S3 bucket in parquet file format for further data analytics processing.