You are a machine learning specialist for a research firm.
Your team uses Amazon SageMaker and its built-in scikit-learn library for feature transformation in your machine learning process.
When using the SimpleImputer transformer to replace missing values in your observations, which strategy is the default strategy that your SageMaker scikit-learn code will use if you don't explicitly pass a strategy parameter?
Click on the arrows to vote for the correct answer
A. B. C. D. E.Answer: D.
Option A is incorrect.
The default strategy is mean.
The constant strategy replaces the missing values with a constant you supply.
Option B is incorrect.
The default strategy is mean.
The most_frequent strategy replaces the missing values with the most frequent value along each column.
Option C is incorrect.
The default strategy is mean.
The median strategy replaces the missing values with the median along each column.
Option D is correct.
The default strategy is mean.
The mean strategy replaces the missing values with the mean along each column.
Option E is incorrect.
There is no mode strategy in the SimpleImputer scikit-learn transformer.
Reference:
Please see the Amazon Machine Learning blog titled Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn.
The SimpleImputer transformer is a scikit-learn transformer provided by Amazon SageMaker to replace missing values in observations with a strategy defined by the user.
If you do not explicitly pass a strategy parameter, the default strategy that SageMaker's scikit-learn code will use is "mean".
The "mean" strategy replaces missing values using the mean of the non-missing values of the same feature in the training data. This strategy is suitable for continuous numerical data.
The "median" strategy replaces missing values using the median of the non-missing values of the same feature in the training data. This strategy is also suitable for continuous numerical data and is more robust to the presence of outliers than the "mean" strategy.
The "most_frequent" strategy replaces missing values using the most frequent value of the same feature in the training data. This strategy is suitable for categorical data.
The "constant" strategy replaces missing values with a user-defined constant value. This strategy is suitable for both categorical and numerical data.
Finally, the "mode" strategy is similar to the "most_frequent" strategy but is used specifically for imputing missing values in Pandas series data.
Therefore, the correct answer to the question is B. most_frequent.