You work as a machine learning specialist for a polling company.
For the upcoming election, you need to classify the over 500,000 registered voters in your voter database by age for a campaign your team is about to launch.
Your data is structured as such: | voter_id | voter_age | voter_occupation | voter_income | … |1 |21|student|0 | … |2 |35|nurse |25000 | … |3 |49|manager| 150000| … |4 |63|truck driver|45000 | … |5 |55|teacher|65000 | … … Because you have continuous data for your voter age feature, classifying your observations by age would result in too many classifications, i.e., one for every possible voter age from 21 though probably over 90
You need to have uniform classifications that are limited in number to make the best use of your data in your machine learning model. What numerical feature engineering technique will give you the best distribution of classifications?
Click on the arrows to vote for the correct answer
A. B. C. D. E.Answer: E.
Options A is incorrect.
From the Amazon Machine Learning developer guide titled Data Transformations Reference, “The Cartesian product transformation takes categorical variables or text as input, and produces new features that capture the interaction between these input variables.” Because this transformation is for transforming text, it would not give you uniform age classifications that are limited in number.
Option B is incorrect.
From the Amazon Machine Learning developer guide titled Data Transformations Reference, “The n-gram transformation takes a text variable as input and produces strings corresponding to sliding a window of (user-configurable) n words, generating outputs in the process.” Because this transformation is also for transforming text, it would not give you uniform age classifications that are limited in number.
Option C is incorrect.
From the Amazon Machine Learning developer guide titled Data Transformations Reference, “The OSB transformation is intended to aid in text string analysis and is an alternative to the bi-gram transformation (n-gram with window size 2)
OSBs are generated by sliding the window of size n over the text and outputting every pair of words that includes the first word in the window.” Because this transformation is also for transforming text, it would not give you uniform age classifications that are limited in number.
Option D is incorrect.
From the Amazon Machine Learning developer guide titled Data Transformations Reference, “The normalization transformer normalizes numeric variables to have a mean of zero and variance of one.
Normalization of numeric variables can help the learning process if there are very large range differences between numeric variables because variables with the highest magnitude could dominate the ML model, no matter if the feature is informative with respect to the target or not.” Because this transformation is for normalizing continuous data, it would not give you uniform age classifications that are limited in number.
Option E is correct.
From the Amazon Machine Learning developer guide titled Data Transformations Reference, “The quantile binning processor takes two inputs, a numerical variable and a parameter called bin number, and outputs a categorical variable.
The purpose is to discover non-linearity in the variable's distribution by grouping observed values together.” Because Quantile binning is used to create uniform bins of classifications, it would be the right choice to give you uniform age classifications that are limited in number.
For example, you could create classification bins such as: Under 30, 30 to 50, Over 50
Or even better: Millennial, Generation X, Baby Boomer, etc.
Reference:
Please see the Amazon Machine Learning developer guide titled Data Transformations for Machine Learning and the article Feature Engineering in Machine Learning (Part 1) Handling Numeric Data with Binning.
The best numerical feature engineering technique for this problem is E. Quantile Binning Transformation.
Quantile binning is a numerical feature engineering technique used to reduce the number of distinct values in a continuous variable by grouping similar values together. This is done by dividing the range of values into a set of intervals or bins, such that each bin contains a roughly equal number of observations.
In the context of this problem, we can use quantile binning to group similar ages together and reduce the number of possible classifications. For example, we could divide the ages into bins of 10 years each (e.g., 20-29, 30-39, etc.) or into quartiles (e.g., under 35, 35-50, 50-65, over 65).
The advantage of using quantile binning is that it can help to reduce overfitting in the model by reducing the noise in the data caused by having too many distinct values for a continuous variable. Additionally, it can simplify the model by reducing the number of variables that need to be considered, which can help to improve the interpretability of the model.
The other answer choices are not appropriate for this problem:
A. Cartesian Product Transformation: This is a method for creating new features by combining two or more categorical variables. It is not applicable for this problem, which involves a continuous variable.
B. N-Gram Transformation: This is a text feature engineering technique used for natural language processing. It is not applicable for this problem, which involves a numerical variable.
C. Orthogonal Sparse Bigram (OSB) Transformation: This is a feature engineering technique used for natural language processing. It is not applicable for this problem, which involves a numerical variable.
D. Normalization Transformation: This is a feature scaling technique used to rescale the values of a continuous variable to a range between 0 and 1. It is not applicable for this problem, which involves grouping similar values together.
Therefore, the best answer is E. Quantile Binning Transformation.