Custom Audio Recommendation Model for Criminal Investigation | AWS ML Specialty Exam

Alternative Approach for Loading Large Datasets in SageMaker Notebook Instances

Question

You work as a machine learning specialist for an audio processing and distribution company.

You are currently working on a custom audio recommendation model for a criminal investigation application that recommends which audio file to use based on investigation details.

The dataset you are attempting to use to train the model is extremely large, containing millions of data points.

You are storing the dataset in an S3 bucket.

You need to find an alternative to loading all of the data into a SageMaker notebook instance because it would take too long to load and exceed the 50 GB EBS volume attached to the notebook instance.

Which approach should you select so that you can load all the data to train the model?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: D.

Option A is incorrect.

Using a Deep Learning AMI for your EC2 instance will not help with loading the extremely large training dataset from S3.

Option B is incorrect.

Glue cannot be used to load data into your SageMaker notebook to train a machine learning model.

Glue could be used to place your data onto an S3 bucket that you could then use to load the data into your SageMaker notebook.

Option C is incorrect.

This approach does not address the loading of the extremely large dataset onto the local storage of the EC2 instance.

Option D is correct.

With this option, we are using the pipe input mode to stream our training data directly to our training instance instead of downloading it to our EBS storage first.

This solves the problem of loading extremely large data into our notebook instance.

References:

Please see the AWS Machine Learning blog titled Using Pipe input mode for Amazon SageMaker algorithms (https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/)

The AWS Machine Learning blog titled Accelerate model training using faster Pipe mode on Amazon SageMaker (https://aws.amazon.com/blogs/machine-learning/accelerate-model-training-using-faster-pipe-mode-on-amazon-sagemaker/),

The AWS announcement titled Amazon SageMaker Now Supports an Improved Pipe Mode Implementation (https://aws.amazon.com/about-aws/whats-new/2018/10/amazon-sagemaker-now-supports-an-improved-pipe-mode-implementati/),

The Amazon Amazon SageMaker developer guide titled Use Scikit-learn with Amazon SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/sklearn.html)

Option B is the correct answer to this question.

Explanation:

The given scenario requires an alternative to loading all of the data into a SageMaker notebook instance. This is because the dataset is very large, containing millions of data points and exceeds the 50 GB EBS volume attached to the notebook instance. Therefore, we need to find a way to train the model using the entire dataset without loading all of the data into the SageMaker notebook instance.

Option A suggests splitting the training dataset using scikit-learn or pandas to create a subset of your training data. Load the subset of the training data into the SageMaker notebook and train the model in your notebook instance. Verify that the model trained accurately and that the model parameters produce reasonable results. Use a Deep Learning AMI to start an EC2 instance and attach the S3 bucket to train the full dataset.

This approach has a few issues. First, it is time-consuming to split the large dataset and verify that the model trained accurately and that the model parameters produce reasonable results. Additionally, using a Deep Learning AMI to start an EC2 instance may not be the best approach since it is not the most cost-effective way to train a machine learning model.

Option B suggests splitting the training dataset using scikit-learn or pandas to create a subset of your training data. Use Glue to load your data into your SageMaker notebook, using your subset of the training data to verify that the model trained accurately and that the model parameters produce reasonable results. Next, run a training job using the entire dataset from the S3 bucket using Pipe input mode.

This approach is the best option since it allows us to split the dataset and train the model using a subset of the data in the SageMaker notebook instance. We can then use Glue to load the entire dataset from the S3 bucket and train the model using Pipe input mode, which allows us to stream the data from S3 to the training algorithm without loading the entire dataset into memory. This approach is cost-effective and allows us to train the model using the entire dataset without exceeding the 50 GB EBS volume attached to the notebook instance.

Option C suggests using a Deep Learning AMI to start an EC2 instance and attach the S3 bucket. Split the training dataset using scikit-learn or pandas to create a subset of your training data. Train using the subset of the training data to verify the training code and hyperparameters. Use SageMaker to train using the full dataset.

This approach is similar to option A, where we use a Deep Learning AMI to start an EC2 instance to train the model. However, it suggests training using SageMaker to train the full dataset, which may not be the most cost-effective way to train a machine learning model.

Option D suggests splitting the training dataset using scikit-learn or pandas to create a subset of your training data. Load the subset of the training data into the SageMaker notebook and train in your notebook. Verify that the model trained accurately and that the model parameters produce reasonable results. Run a SageMaker training job loading the complete dataset from the S3 bucket using Pipe input mode.

This approach is similar to option B, where we split the dataset, train using a subset of the data in the SageMaker notebook instance, and use Pipe input mode to load the entire dataset from S3 to the training algorithm. However, it suggests running the SageMaker training job after training in the notebook, which may not be necessary since we can use Glue to load the entire dataset and train the model using Pipe input mode. Additionally, this approach may be time-consuming since we need to train the model twice, once in the notebook and once using SageMaker.

In conclusion, Option B is the best approach since it allows us to split the dataset, train using a