Preparing Historical Trading Data for Training a Production Model | SEO Best Practices | AWS Certified Machine Learning - Specialty Exam

How to Prepare Historical Trading Data for Training a Production Model

Question

You work as a machine learning specialist in a financial services company in their asset management division.

You have completed a pilot of a Random Cut Forest algorithm-based model that you plan to use to find anomalies in your trading data.

You have tested your model using a small data sample and are ready to implement your production model using SageMaker.

The historical trading data that you will use for training is stored in an RDS Microsoft SQL Server database.

How should you prepare your historical trading data to train your production model?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: C.

Option A is incorrect.

You cannot load data directly from ElastiCache into a SageMaker notebook.

Option B is incorrect.

You cannot load data into a SageMaker notebook directly from DynamoDB without first staging the data in S3.

Option C is correct.

Loading your training data from S3 is the best approach for getting your data into your SageMaker notebook.

Also, the Data Pipeline service is the preferred method of loading data from Microsoft SQL Server to S3.

Option D is incorrect.

You cannot directly load data from an RDS instance into a SageMaker notebook instance without first staging the data in S3.

References:

Please see the AWS Re:Invent 2018 presentation titled Train Models on Amazon SageMaker Using Data Not from Amazon S3 (AIM419) - AWS re:Invent 2018 (https://www.slideshare.net/AmazonWebServices/train-models-on-amazon-sagemaker-using-data-not-from-amazon-s3-aim419-aws-reinvent-2018),

The AWS SageMaker developer guide titled Random Cut Forest (RCF) Algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html),

The AWS Data Pipeline developer guide titled What is AWS Data Pipeline? (https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html),

The Amazon Amazon SageMaker developer guide titled Download, Prepare, and Upload Training Data (https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-data.html)

The correct answer is C. Use the Data Pipeline service to move your trading data from the Microsoft SQL Server database to S3, then use the S3 bucket within your SageMaker notebook to load your trading data.

Explanation: Before training a model, the data should be pre-processed and transformed to the desired format. In this case, the historical trading data is stored in an RDS Microsoft SQL Server database. To prepare the data for training the model using Amazon SageMaker, we need to move the data from the database to an S3 bucket.

Option A suggests using the DMS service to move the data to ElastiCache, which is a good option for reducing the latency when accessing the data. However, ElastiCache is not designed for storing data long-term, and the data might need to be regularly refreshed. Moreover, moving data from RDS to ElastiCache can be complicated and add extra complexity.

Option B suggests using a Lambda function to load the data to DynamoDB tables. While this option is possible, it is not the best choice because DynamoDB is not ideal for storing large amounts of data, as it is not a good fit for data that requires advanced querying or analysis.

Option D suggests connecting directly to the SQL database using Direct Connect from the SageMaker notebook. Although this option is possible, it is not recommended because it would increase the complexity of the solution and require opening ports to the database which can be a security risk. Moreover, transferring large amounts of data over a Direct Connect connection may cause network latency issues.

Option C suggests using the Data Pipeline service to move the data from the Microsoft SQL Server database to S3, then using the S3 bucket within the SageMaker notebook to load the trading data. This is the best option because Data Pipeline can move data between different storage and compute services without the need for custom code, and it has built-in support for RDS and S3. The pipeline can be scheduled to run at specific intervals to update the data, and the data can be easily accessed by the SageMaker notebook for training the model.

In summary, the correct option is C. Use the Data Pipeline service to move your trading data from the Microsoft SQL Server database to S3, then use the S3 bucket within your SageMaker notebook to load your trading data.