Feature Engineering for Missing Data in Machine Learning | Best Approaches | Online Reseller

Feature Engineering for Missing Data

Question

You work on a machine learning team at an online reseller of consumer products.

You are performing feature engineering of your product data where you have a large, multi-column dataset with one column missing 40% of its data.

Your team lead thinks that you can use some of the columns in the dataset to create the missing data. Which feature engineering is the best approach to create approximate replacements for the missing data while also preserving the integrity of the dataset?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: C.

Option A is incorrect.

Binning is used for grouping values.

It is used to minimize the impact of observation errors.

Binning would not help you create approximate replacements for missing values.

Option B is incorrect.

Yeo-Johnson transformation is used to give your data a more Gaussian distribution.

It is not used to create approximate replacements for the missing data.

Option C is CORRECT.With multivariate imputation, you use other variables in the data set to predict missing values.

Option D is incorrect.

Mean imputation replaces the missing values with the mean of observed values of that variable.

This approach is the most simplistic method of the imputation of missing values.

The multivariate imputation method is much more accurate.

Reference:

Please see the article titled Binning in Data Mining.

Please refer to the Machine Learning Mastery article titled How to Use Power Transforms for Machine Learning.

Please see the article titled Multiple Imputation in a Nutshell.

The best approach to create approximate replacements for the missing data while preserving the integrity of the dataset is Multivariate imputation (Option C).

Multivariate imputation is a technique that replaces missing values with estimated values based on the correlations observed between variables. It involves creating a model for each variable in the dataset that has missing data, based on the other variables in the dataset that have complete data.

This approach is preferable to mean imputation (Option D) because mean imputation only uses the mean value of the available data to fill in the missing data, which can skew the results and reduce the accuracy of the model.

Binning (Option A) is a method used to reduce the noise in continuous data by grouping them into categories. While this can be useful in some situations, it is not an appropriate approach to use when trying to replace missing values in a dataset.

Yeo-Johnson transformation (Option B) is a technique used to transform non-normally distributed data into normally distributed data. While this can be helpful in some cases, it is not a suitable method for replacing missing data.

In summary, Multivariate imputation is the most appropriate method for replacing missing data in a large, multi-column dataset with missing values, as it uses correlations between variables to estimate missing values and preserve the integrity of the dataset.