AWS ML Specialty Exam: XGBoost Model Deployment with Amazon SageMaker

XGBoost Model Deployment with Amazon SageMaker

Question

You work as a machine learning specialist for an eyewear manufacturing plant.

There you have used XGBoost to train a model that uses assembly line image data to categorize contact lenses as malformed or correctly formed.

You have engineered your data and used CSV as your Training ContentType.

You are now ready to deploy your model using the Amazon SageMaker hosting service. Assuming you used the default configuration settings, which of the following are true statements about your hosted model? (Select THREE)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F. G.

Answers: C, F, G.

Option A is incorrect.

The SageMaker XGBoost currently only supports a CPU instance or a single-instance GPU instance type for training.

(See the Amazon SageMaker developer guide titled XGBoost Algorithm, particularly the EC2 Instance Recommendation for the XGBoost Algorithm section)

Option B is incorrect.

The XGBoost algorithm is parallelizable and therefore can be deployed on multiple CPU instances for distributed training.

(See the Amazon SageMaker developer guide titled Common Parameters for Built-in Algorithms)

Option C is correct.

From the Amazon SageMaker developer guide titled XGBoost Algorithm, “For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record”.

Option D is incorrect.

From the Amazon SageMaker developer guide titled Common Data Formats for Training, “Amazon SageMaker requires that a CSV file doesn't have a header record and that the target variable is in the first column”.

Option E is incorrect.From the Amazon SageMaker developer guide titled XGBoost Algorithm, “For CSV inference, the algorithm assumes that CSV input does not have the label column”.

Option F is correct.From the Amazon SageMaker developer guide titled XGBoost Algorithm, “For CSV inference, the algorithm assumes that CSV input does not have the label column”.

Option G is correct.The SageMaker XGBoost currently only supports a CPU instance type for training.

(See the Amazon SageMaker developer guide titled XGBoost Algorithm, particularly the EC2 Instance Recommendation for the XGBoost Algorithm section)

Reference:

Please see the Amazon SageMaker developer guide titled Deploy a Model on Amazon SageMaker Hosting Services for an overview of the deployment of a SageMaker model.

Based on the information provided in the question, the following are the true statements about the hosted model:

A. The training instance class is multiple-instance GPU:

This statement is true as Amazon SageMaker's default setting for the training instance class is ml.p2.xlarge, which is a multiple-instance GPU instance.

B. The algorithm is not parallelizable for distributed training:

This statement cannot be determined from the information given in the question. It depends on the specific implementation of the XGBoost algorithm used for training the model.

C. The training data target value should be in the first column of the CSV with no header:

This statement is false. XGBoost requires the target variable to be in the last column of the CSV file. Additionally, it is best practice to include headers in the CSV file.

D. The training data target value should be in the last column of the CSV with no header:

This statement is true. As mentioned above, XGBoost requires the target variable to be in the last column of the CSV file.

E. The inference data target value should be in the first column of the CSV with no header:

This statement is false. Inference data should not include the target variable, as it is used only for training the model and not for making predictions.

F. The inference CSV data has no label column:

This statement is true. Inference data should not include the target variable or any other labels.

G. The training instance class is CPU:

This statement is false, as mentioned above. The default instance class for training is ml.p2.xlarge, which is a multiple-instance GPU instance.

In summary, the true statements are A, D, and F.