You work for a city electric scooter rental company.
Your company supplies a fleet of electric scooters to different cities around the country.
These scooters need to be managed as far as their location, their rental miles, their need for maintenance, etc.
The company accumulates hundreds of data points on each scooter every day.
You are on the machine learning team of your company, where you have been assigned the job of building a machine learning model to track each scooter and decide when they are ready for maintenance.
One would assume the decision for maintenance would be based predominantly on miles accumulated.
Since you have so many features captured for a given scooter, you have decided you need to find the most predictive features in your model to avoid low model performance due to collinearity. You have built your model in SageMaker using the built-in XGBoost algorithm.
Using the XGBoost Python API package, which type of booster and which API call would you use to select the most predictive features based on the total gain across all splits in which the feature is used?
Click on the arrows to vote for the correct answer
A. B. C. D. E. F.Answer: C.
Option A is incorrect.
To get the features based on the total gain across all splits in which the feature is used, you need to use the gbtree booster and call get_score, passing the parameter importance_type set to total_gain.
Feature importance is defined only for the base learner, or tree boosters.
Feature importance is not defined for linear learners.
The importance_type parameter is defined for the get_score API call, not the get_fscore API call.
Option B is incorrect.
To get the features based on the total gain across all splits in which the feature is used, you need to use the gbtree booster and call get_score, passing the parameter importance_type set to total_gain.
Feature importance is defined only for the base learner, or tree boosters.
Feature importance is not defined for linear learners.
The importance_type parameter needs to be set to total_gain to get the total gain across all splits in which the feature is used.
The importance_type parameter of gain gives you the average gain across all splits in which the feature is used.
Option C is correct.To get the features based on the total gain across all splits in which the feature is used, you need to use the gbtree booster and call get_score, passing the parameter importance_type set to total_gain.
Option D is incorrect.
To get the features based on the total gain across all splits in which the feature is used, you need to use the gbtree booster and call get_score, passing the parameter importance_type set to total_gain.
The importance_type parameter needs to be set to total_gain to get the total gain across all splits in which the feature is used.
The importance_type parameter of gain gives you the average gain across all splits in which the feature is used.
Option E is incorrect.
To get the features based on the total gain across all splits in which the feature is used, you need to use the gbtree booster and call get_score, passing the parameter importance_type set to total_gain.
Feature importance is defined only for the base learner, or tree boosters.
Feature importance is not defined for dart boosters.
The importance_type parameter needs to be set to total_gain to get the total gain across all splits in which the feature is used.
The importance_type parameter of gain gives you the average gain across all splits in which the feature is used.
Option F is incorrect.
To get the features based on the total gain across all splits in which the feature is used, you need to use the gbtree booster and call get_score, passing the parameter importance_type set to total_gain.
Feature importance is defined only for the base learner, or tree boosters.
Feature importance is not defined for dart boosters.
Reference:
Please see the Amazon SageMaker developer guide titled XGBoost Algorithm, the Amazon SageMaker developer guide titled XGBoost Hyperparameters, and the XGBoost Python API Reference.
The correct answer is C. booster = gbtree using the get_score with importance_type parameter set to total_gain.
Explanation:
XGBoost is a popular algorithm for building machine learning models, particularly for tabular data. It is an ensemble learning method that combines multiple decision trees to make predictions. There are two types of boosters in XGBoost: gbtree and gblinear. Gbtree is a tree-based booster, while gblinear is a linear booster.
To select the most predictive features, we need to calculate the feature importance score. Feature importance indicates how much a feature contributes to the model's prediction. In XGBoost, feature importance can be calculated based on either gain or cover.
Gain: The total gain across all splits in which the feature is used. Cover: The number of times a feature is used to split the data across all trees.
To calculate the feature importance score, we can use the get_score() or get_fscore() function in the XGBoost Python API package. The get_score() function returns a dictionary of feature importance scores, while the get_fscore() function returns a dictionary of the number of times each feature is used to split the data.
The importance_type parameter specifies whether to use gain or cover to calculate the feature importance score. In this case, we need to use gain, so we set importance_type to total_gain.
Therefore, the correct API call to select the most predictive features based on the total gain across all splits in which the feature is used is:
pythonbooster = 'gbtree' importance_type = 'total_gain' bst.get_score(importance_type=importance_type)
Option A (booster = gblinear using the get_fscore with importance_type parameter set to total_gain) is incorrect because gblinear is a linear booster and does not support the total_gain importance_type. Instead, gblinear uses cover.
Option B (booster = gblinear using the get_score with importance_type parameter set to gain) is incorrect because gblinear does not support gain importance_type.
Option D (booster = gbtree using the get_fscore with importance_type parameter set to gain) is incorrect because it calculates feature importance based on the number of times each feature is used to split the data, not the total gain across all splits.
Option E (booster = dart using the get_fscore with importance_type parameter set to gain) is incorrect because it calculates feature importance based on the number of times each feature is used to split the data, not the total gain across all splits.
Option F (booster = dart using the get_score with importance_type parameter set to total_gain) is incorrect because it is not clear whether the dart booster supports the total_gain importance_type. Additionally, the question specifies that we are using the gbtree booster, not the dart booster.