You work as a machine learning specialist at a retail shoe manufacturer.
Your marketing department wants to do a promotion for a new running shoe they are about to release into their product pipeline.
They need a model to predict sales of the new shoe using the purchase history of their registered customers based on past releases of new shoes. You have decided to use a linear regression algorithm for your model.
Your data has thousands of observations and 35 numeric features.
While doing analysis to understand your data better, you find 25 observations that have what looks like outlier data points.
After speaking to your marketing department, you learn that these values are valid.
You also find several hundred observations that have some blank feature values. How should you correct the outlier and blank feature problems?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: C.
Option A is incorrect.
Null values in observation should be replaced since linear regression calculations will have a problem with null values.
Therefore, you would not replace empty fields with null.
Option B is incorrect.
Removing the observations with blank values will reduce the accuracy of your model's predictions since you have removed many features from the training dataset.
Option C is correct.
You should remove the outlier observations.
You should also replace the blank values with a meaningful value.
The mean value is the best option of those listed.
Option D is incorrect.You should remove the outlier observations.
You should also replace the blank values with a meaningful value.
The 0 value is not the best option of those listed because the mean is invariably a better approximation than 0 for a continuous numeric value.
Reference:
Please see the Amazon Machine Learning developer guide titled Feature Processing.
When building a machine learning model, it's crucial to preprocess the data correctly to ensure the model's accuracy and efficiency. In this scenario, you have decided to use a linear regression algorithm for the model. Linear regression is a type of supervised learning algorithm used to predict the continuous output variable based on the input features. Before training the linear regression model, the data needs to be preprocessed, and the problems related to the outlier and blank feature data need to be addressed.
Outlier data points are the data points that lie far away from the other data points in the dataset. These data points can negatively impact the linear regression model's accuracy as they can influence the model's coefficients and bias the model's predictions. In this scenario, you have found 25 observations with outlier data points, but after speaking to the marketing department, you learned that these values are valid.
In such cases, instead of removing the observations with outlier data points, you can consider a few options:
On the other hand, blank feature values can cause issues when building a linear regression model. These missing values need to be handled before feeding the data into the model. There are several ways to handle missing data:
Based on the available options in the exam, the best approach for correcting the outlier and blank feature problems is option C: Remove the observations with the outlier data points and replace the blank values with the mean value. This approach removes the outlier data points that could bias the model's predictions and imputes the missing values with the mean value, which can work well for continuous features that follow a normal distribution.