You are working for a consulting firm in its machine learning practice.
Your current client is a sports equipment manufacturer.
You are building a linear regression model to predict ski and snowboard sales based on the daily snowfall in various regions around the country.
After you have cleaned and performed feature engineering on your CSV data, which of the following tasks would you perform next?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: D.
Option A is incorrect.
The scikit-learn cross_validate method is used to evaluate your model's precision while tuning the model's hyperparameters.
(See Scikit-Learn user guide titled cross_validate)
Option B is incorrect.
Using a Pandas DataFrame to remove superfluous rows and features is part of cleaning and doing feature engineering of your data, which you have already done.
Option C is incorrect.
One-hot encoding is another way to do feature engineering on your data in preparation for training.
You have already completed the cleaning and feature engineering of your data.
Option D is correct.Once you have cleaned and engineered your data for a linear regression model, you need to shuffle the data to prevent overfitting and reduce variance.
(See Amazon Machine Learning developer guide titled The Amazon Machine Learning Process)
Reference:
Please see the Amazon Machine Learning developer guide titled Machine Learning Concepts, and the Amazon Machine Learning developer guide titled The Amazon Machine Learning Process.
After cleaning and performing feature engineering on the CSV data, the next step would be to prepare the data for model training.
Option A: Using the scikit-learn cross_validate method to evaluate the estimation precision of the model is a valid step, but it should be performed after the data preparation step.
Option B: Loading the data into a pandas DataFrame and removing header rows and any superfluous features is a valid step in data preparation. This step involves reading the data into a tabular format that can be easily manipulated, and removing any unnecessary features that do not contribute to the model's accuracy.
Option C: Using one-hot encoding to convert categorical values, such as ‘region of the country' to numerical values is also a valid step in data preparation. Categorical variables need to be converted to numerical values for the model to understand them. One-hot encoding is a common technique to convert categorical variables into numerical features, where each category is represented as a binary variable.
Option D: Shuffling the data using a shuffling technique is also a valid step in data preparation. This step involves randomly reordering the data rows to remove any inherent structure in the data that may bias the model. Shuffling the data is particularly important in cases where the data is ordered based on a certain criterion, such as time.
Therefore, the correct answer is that options B, C, and D are all valid steps in data preparation after cleaning and performing feature engineering on the CSV data. Option A is a valid step, but it should be performed after the data preparation step.