You work for a rideshare software company as a machine learning specialist.

You are working on a model to predict driver capacity based on several factors, such as location, time of day, weather, population density, age of the car, etc.

You have several million observations stretching back over 5 years across several geographic locations worldwide.

You have performed feature engineering on your data, and you have transformed it into 5 CSV files (one for each year) which you have uploaded to your S3 bucket training prefix. Due to a large number of observations, your management team anticipates that training this model could get costly, so they have asked you to keep the costs of your project as low as possible. You have written the following python code using the SageMaker Python SDK in your SageMaker jupyter notebook: s3_train = sagemaker.s3_input(s3_data='s3://{}/{}'.format(bucket, path_train), content_type='csv',distribution='ShardedByS3Key') my_container = get_image_uri(boto3.Session().region_name, 'xgboost') my_session = sagemaker.Session() role = get_execution_role() xgb = sagemaker.estimator.Estimator(my_container, role, train_instance_count=5, train_instance_type='ml.m4.xlarge', output_path=output_path, sagemaker_session=my_session) xgb.set_hyperparameters( max_depth=10, eta=0.2, gamma=4, min_child_weight=40, subsample=0.8, silent=0, objective='reg:linear', early_stopping_rounds=10, num_round=200 ) xgb.fit({'train': s3_train, 'validation': s3_input_validation}) Using this code, how does SageMaker replicate your dataset to your Machine Learning instances for training?

Question

You work for a rideshare software company as a machine learning specialist.

You are working on a model to predict driver capacity based on several factors, such as location, time of day, weather, population density, age of the car, etc.

You have several million observations stretching back over 5 years across several geographic locations worldwide.

You have performed feature engineering on your data, and you have transformed it into 5 CSV files (one for each year) which you have uploaded to your S3 bucket training prefix. Due to a large number of observations, your management team anticipates that training this model could get costly, so they have asked you to keep the costs of your project as low as possible. You have written the following python code using the SageMaker Python SDK in your SageMaker jupyter notebook: s3_train = sagemaker.s3_input(s3_data='s3://{}/{}'.format(bucket, path_train), content_type='csv',distribution='ShardedByS3Key') my_container = get_image_uri(boto3.Session().region_name, 'xgboost') my_session = sagemaker.Session() role = get_execution_role() xgb = sagemaker.estimator.Estimator(my_container, role, train_instance_count=5, train_instance_type='ml.m4.xlarge', output_path=output_path, sagemaker_session=my_session) xgb.set_hyperparameters( max_depth=10, eta=0.2, gamma=4, min_child_weight=40, subsample=0.8, silent=0, objective='reg:linear', early_stopping_rounds=10, num_round=200 ) xgb.fit({'train': s3_train, 'validation': s3_input_validation}) Using this code, how does SageMaker replicate your dataset to your Machine Learning instances for training?

Exam-Answer · Accepted Answer

SageMaker replicates a subset of your dataset on each of the 5 ML instances that are launched for training.

AWS Certified Machine Learning - Specialty Exam: Replicating Dataset in SageMaker

Question

Answers

Explanations