Creating a Scalable and Cost-Effective Data Repository for Machine Learning | Answer to AWS Certified Machine Learning Specialty Exam

Scalable and Cost-Effective Data Repository for Machine Learning

Question

You work for a global consulting company as a machine learning specialist.

You work with a team of data scientists that continually create datasets for your consultancy's analysis and trend prediction work using machine learning.

You have been assigned the job of creating a data repository to store the large amount of training data generated by your data scientists for use in your machine learning models.

You do not know how many new datasets your data scientists will create each day, so your solution must scale automatically, and your management team wants the storage solution to be cost-effective.

Also, the data scientists and machine learning specialists must be able to query the data using SQL.

Which option is the best solution to meet your requirements?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: D.

Option A is incorrect.

DynamoDB will not be the most cost-effective option when compared to S3

Also, there is no requirement suggesting the need for global data distribution.

Also, when you need to use the data in your machine learning models, you will have to extract the data from DynamoDB and put it to S3.

Option B is incorrect.

Redshift using RA3 node types with managed storage will give you very fast query access to your data, but it is not cost-effective when compared to S3

Also, when you need to use the data in your machine learning models, you can leverage Redshift Spectrum to extract the data from Redshift and put it to S3

But this adds another level of complexity and cost.

Option C is incorrect.

Using EC2 instances and EFS volumes to store your data is very cost-ineffective when compared to using S3

Also, when you need to use the data in your machine learning models, you will have to move the data from the EC2 EFS volumes to S3.

Option D is correct.

S3 is cost-effective, scales infinitely to any dataset volume and size, and it is where you need to have your data when you use it in your machine learning models.

References:

Please see the Amazon Machine Learning blog titled Building secure machine learning environments with Amazon SageMaker (https://aws.amazon.com/blogs/machine-learning/building-secure-machine-learning-environments-with-amazon-sagemaker/),

The Amazon Redshift products page titled Amazon Redshift Pricing (https://aws.amazon.com/redshift/pricing/),

The Amazon DynamoDB products page titled Amazon DynamoDB pricing (https://aws.amazon.com/dynamodb/pricing/)

The best solution to meet the given requirements is option D - Have your data scientists store their new datasets as files in S3.

Amazon Simple Storage Service (S3) is an object storage service that provides scalable, durable, and secure storage for objects (which can be any type of data, such as documents, images, videos, or machine learning datasets). S3 is designed to provide high durability and availability for stored objects, and it can automatically scale to accommodate any amount of data.

Option A - Have your data scientists store their new datasets in DynamoDB using global tables: DynamoDB is a NoSQL database service that provides low latency and high performance at any scale. However, DynamoDB is not optimized for storing large volumes of unstructured data such as machine learning datasets. Also, using global tables to replicate data across regions can increase costs.

Option B - Have your data scientists store their new datasets as tables in a Redshift cluster using RA3 nodes with managed storage and Redshift Spectrum: Amazon Redshift is a fully managed data warehouse service that can handle petabyte-scale data warehouses. RA3 nodes with managed storage and Redshift Spectrum can support complex queries and enable you to directly query data stored in S3. However, Redshift can be expensive and may not be cost-effective for storing large volumes of unstructured data such as machine learning datasets.

Option C - Have your data scientists store their new datasets as files in an EFS attached to EC2 instances: Amazon Elastic File System (EFS) is a fully managed, scalable file storage service that can be attached to Amazon Elastic Compute Cloud (EC2) instances. However, EFS can be expensive and may not be cost-effective for storing large volumes of data such as machine learning datasets. Also, querying data in EFS can be more complex and require additional configuration.

Option D - Have your data scientists store their new datasets as files in S3: S3 is a highly durable and scalable storage service that is designed to store and retrieve any amount of data from anywhere. S3 can also be cost-effective, with pricing based on usage and storage tiers. Additionally, S3 supports SQL queries through services like Amazon Athena, which can allow data scientists and machine learning specialists to easily query and analyze the data stored in S3.

Therefore, Option D is the best solution to meet the given requirements of scalability, cost-effectiveness, and ability to query data using SQL.