You are a machine learning specialist working for a large insurance company.
You are building a machine learning model to predict the likelihood of an insured customer committing insurance fraud.
Your training dataset has many attributes about the insured, the insurance policy, and their insurance claims.
As its prediction, your model needs to produce a continuous value of the probability of fraud for any given customer claim.
The feature set of your training data includes labeled outcomes for a set of 100,000 insurance claim observations.
When you visualize the training dataset, you see that out of the 100,000 insurance claims, 24,350 claim records show the policy term length of 0 years.
The remaining features for these observations show no anomalies.
Which feature engineering option will give you the best dataset for your model training?
Click on the arrows to vote for the correct answer
A. B. C. D.Correct Answer: B.
Option A is incorrect.
The k-means algorithm is an unsupervised learning algorithm where we do not have labeled data.
The k-means algorithm is used for clustering.
This is not the best choice, nor is it a choice used by practicing machine learning specialists for feature imputation.
Unsupervised learning using unlabeled data will give inferior results when compared to supervised learning with labeled data.
Option B is correct.
The K Nearest Neighbor algorithm, when used for classification, is a supervised learning algorithm where we have labeled data.
Using KNN, you can impute missing values using feature similarity to predict missing values based on the other non-missing values in the feature.
This is a very common approach used by machine learning specialists to impute missing values.
Option C is incorrect.
While it is common to replace missing feature values with the simple mean or median of the feature, this method is far less accurate than using the KNN approach to impute your missing values.
Option D is incorrect.
Dropping the records with the missing values is another common approach for dealing with missing feature values.
However, this approach reduces your feature set significantly in this scenario.
You have missing features in approximately 24% of your training data.
Dropping that many records will reduce the accuracy of your predictions.
References:
Please see the Amazon SageMaker developer guide titled K-Nearest Neighbors (k-NN) Algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html),
The Amazon SageMaker developer guide titled K-Means Algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/k-means.html),
The Towards Data Science article titled 6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples) (https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779)
The best feature engineering option in this scenario would be to use option C - Populate the 0 policy length feature value with the mean or median value of the feature.
Explanation:
The given dataset has 24,350 records with a missing policy length feature. It is essential to handle these missing values to avoid bias in the model training. Here are the possible feature engineering options and their respective advantages and disadvantages:
Option A - Use k-means clustering to impute the missing policy length features: K-means clustering is a method that groups similar data points together. However, this method is not suitable for imputing missing values in the feature, as it requires complete data for clustering. Also, k-means clustering can produce biased results if there is an underlying pattern in the missing data.
Option B - Use KNN to impute the missing policy length features: K-Nearest Neighbors (KNN) is a non-parametric method that imputes the missing values based on the K-nearest neighbors' attributes. This method can work well when the data set is large and the missing data is minimal. However, for the given dataset with 24,350 missing records, KNN may not be an efficient option, as it will require significant computational resources.
Option C - Populate the 0 policy length feature value with the mean or median value of the feature: Populating the missing values with the mean or median value of the feature is a common and straightforward method of handling missing data. It is a good option when the missing data is small and does not significantly affect the overall data distribution. In this case, we can use the mean or median value of the policy length feature to populate the missing values as it is a continuous variable.
Option D - Drop the records from the dataset where policy length is 0: Dropping the records where policy length is 0 is not an efficient option, as it will significantly reduce the data size, and we may lose valuable information that may be present in the other features. Also, this option may introduce bias in the data if the missing data is not missing at random.
Conclusion:
In summary, the best feature engineering option in this scenario would be to use option C - Populate the 0 policy length feature value with the mean or median value of the feature. This method is simple, efficient, and does not introduce bias into the model training.