Categorical Feature Encoding Techniques for Machine Learning Models

Feature Engineering Techniques for Categorical Variables

Question

You work as a machine learning specialist for an online real estate software company.

Your company produces real estate listings with descriptions of properties, such as lot size, number of bedrooms, number of bathrooms, etc.

You have been tasked with building a model to predict the value of the property.

This value will be the estimated value displayed on the property listing.

You are performing feature engineering of your data and you need to encode your categorical features to use in scikit-learn regression algorithms.

You have dozens of categorical features, with many of the features having from 30 to 75 categories.

Which encoding technique should you use for your categorical features?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: D.

Option A is incorrect.

One-hot-encoding with dozens of categorical features, some of which have 30 to 75 categories will explode your feature space.

Option B is incorrect.

Since Label Encoding assigns a numerical value that is essentially an incremental count of the number of categories in the feature, it runs the risk of your regression algorithm assigning value to the order of the encoding.

Option C is incorrect.

Target Encoding (sometimes referred to as Mean Encoding) with a mean transform works well but has issues with infrequent categories in your dataset, in other words a category that is found infrequently in your data source.

Calculating a mean for a very small number of observations provides little differentiation.

Option D is correct.

Combining Target Encoding using a mean transform with smoothing removes the disadvantages of Target Encoding by calculating the average of the category and the target together with the overall average.

Reference:

Please see the Towards Data Science article titled All about Categorical Variable Encoding (https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02), and the Kaggle page titled Target Encoding (https://www.kaggle.com/ryanholbrook/target-encoding)

When dealing with categorical features, it is important to transform them into a numerical representation that can be used in machine learning algorithms. There are several encoding techniques available, and the most appropriate one depends on the characteristics of the data.

In this case, the data has dozens of categorical features, with many of them having from 30 to 75 categories. One-hot-encoding and label encoding are two of the most common encoding techniques, but they might not be the most suitable for this particular case.

One-hot-encoding creates a binary vector for each category, where the vector has a value of 1 for the category that the sample belongs to, and 0 for all other categories. This technique can work well for categorical features with few categories, but for features with many categories, it can create a large number of new features, which can increase the dimensionality of the data and make the model training more complex.

Label encoding, on the other hand, assigns a unique integer value to each category, ranging from 0 to the number of categories minus one. This technique can work well for features with a small number of categories, but for features with many categories, it can create a numerical ordering that does not necessarily reflect any meaningful relationship between the categories.

Given these considerations, the most appropriate encoding technique for this case might be target encoding with mean transform, or target encoding with mean transform plus smoothing, depending on the size of the dataset and the risk of overfitting.

Target encoding is a technique that replaces each category with the mean target value of the samples that belong to that category. This technique can capture the relationship between the categories and the target variable, and can work well for features with many categories.

However, target encoding can also lead to overfitting if there are categories with few samples, or if the same category appears in both the training and test sets. To mitigate this risk, smoothing can be applied, which introduces a regularization parameter that shrinks the target mean towards the overall mean of the target variable.

Therefore, in summary, for the given scenario, target encoding with mean transform or target encoding with mean transform plus smoothing might be the most appropriate encoding techniques for the categorical features.