Data Cleansing Techniques for Machine Learning: Avoiding Biased Importance in Model Training

Data Cleansing Techniques

Question

You work as a machine learning specialist for an online retailer that is expanding into fresh produce as one of its new product categories.

You and your machine learning team have been tasked with creating a model to classify each of your new fresh produce products.

Examples of features in your data source include weight, price, country of origin, food group (fruit, vegetable, etc.), and other numeric and categorical features.

You plan on using either k-nearest neighbors (KNN) or support vector machines (SVM) to classify your fresh produce products.

Which data cleansing technique should you use on your data so that your features with potentially large values, such as weight, don't take on exaggerated importance in the model when compared to features with potentially smaller values, such as price per unit?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: A.

Option A is correct.

When using classification algorithms such as KNN or SVM, you need to scale your data so that each feature has the same scale.

Using scikit-learn MinMaxScaler you can make your features span the same range of values (frequently between 0 and 1)

This allows your features to have equal importance on the model's outcome.

Option B is incorrect.

When you normalize your data you change your data to have equal distribution around the mean.

This will not help with features that are on different scales, like weight and unit price.

Option C is incorrect.

Binning is used to change continuous features into categories.

This will not help with features that are on different scales, like weight and unit price.

Option D is incorrect.

Quantile Binning is used to change continuous features into categories of equal bins.

This will not help with features that are on different scales, like weight and unit price.

Reference:

Please see the Towards Data Science article titled All about Feature Scaling (https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35), the Kaggle page titled Scaling and Normalization (https://www.kaggle.com/alexisbcook/scaling-and-normalization), the Wikipedia page titled Support-vector machine (https://en.wikipedia.org/wiki/Support-vector_machine), the Wikipedia page titled k-nearest neighbors algorithm (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm), and the Towards Data Science article titled Continuous Numeric Data (https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b), and the Scikit-learn modules page titled 6.3

Preprocessing data(https://scikit-learn.org/stable/modules/preprocessing.html), and the Scikit-learn modules page titled sklearn.preprocessing.KBinsDiscretizer (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html)

When dealing with machine learning models, it is often necessary to pre-process the data before training the model. One common issue that can arise is when some features in the dataset have values with significantly different scales. For instance, weight can have values ranging from a few grams to several kilograms, whereas the price per unit may be expressed in dollars and cents.

If we do not account for this difference in scale, features with larger values will dominate the model, leading to incorrect predictions. Therefore, one data cleansing technique that we can use to mitigate this issue is data scaling. Scaling involves transforming the features so that they have the same scale or range of values. Two commonly used scaling techniques are normalization and min-max scaling.

Normalization scales the features so that their values are between 0 and 1, by dividing each value by the maximum value of the feature. Normalization is appropriate when we don't know the range of values in advance. However, normalization doesn't handle outliers very well, and it may distort the distribution of the data.

On the other hand, min-max scaling scales the features so that their values fall within a specified range, usually between 0 and 1. Min-max scaling preserves the shape of the distribution of the data and handles outliers well.

Therefore, in this scenario, the appropriate data cleansing technique to use would be MinMaxScaler, which is available in scikit-learn library. MinMaxScaler scales the data by subtracting the minimum value of the feature and dividing by the range of the feature. This technique ensures that all features are on the same scale and reduces the influence of features with large values over those with smaller values.

Hence, the correct answer is A. Scale your data using scikit-learn MinMaxScaler.