Linear Regression Model Cross-Validation for Unbiased Training Data

Assessing Generalization of a Linear Regression Model with k-fold Cross-Validation

Question

You work as a machine learning specialist for a clothing manufacturer.

You have built a linear regression model using SageMaker's built-in linear learner algorithm to predict sales for a given year.

Your training dataset observations are based on several features such as marketing dollars spent, number of active stores, traffic per store, online traffic to the company website, overall market indicators, etc.

You have decided to use the k-fold method of cross-validation to assess how the results of your model will generalize beyond your training data. Which of these will indicate that you don't have biased training data?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answer: D.

Option A is incorrect.

When using k-fold for cross-validation, the variance of the estimate is reduced as you increase k.

So 10-fold cross-validation should have a lower variance than 5-fold cross-validation.

Option B is incorrect.

The k-fold error function just gives you the error rate of the cross-validation round.

It doesn't resolve bias.

Option C is incorrect.

The goal of k-fold cross-validation is to produce relatively equal error rates for each round (indicating proper randomization of the data), not to reduce the error rate for each round.

Option D is correct.

If you have relatively equal error rates for all k-fold rounds, it is an indication that you have properly randomized your test data, therefore reducing the chance of bias.

Option E is incorrect.The k-fold cross-validation technique is commonly used with linear regression analysis.

Reference:

Please see the Amazon Machine Learning developer guide titled Evaluating ML Models, and the Amazon Machine Learning developer guide titled Cross-Validation.

The correct answer is D. Every k-fold cross-validation round has a very similar error rate to the rate of all the other rounds.

Explanation:

The k-fold method of cross-validation involves dividing the dataset into k equal-sized subsets. One subset is used as a validation set, and the other k-1 subsets are used for training. This process is repeated k times, with each subset used once as the validation set. The results from each of the k runs are then averaged to produce a single performance estimate.

Cross-validation helps to assess how well a model will generalize to new data. If a model is overfitting the training data, it will perform well on the training set but poorly on the validation set, resulting in a high variance estimate. Conversely, if a model is underfitting, it will perform poorly on both the training and validation sets, resulting in a high bias estimate.

Answer A is incorrect because an increase in variance indicates that the model is overfitting the data, which is not desirable. Therefore, we want to see a decrease in variance as we increase k.

Answer B is incorrect because the error function used in training the model does not necessarily remove bias in the data. Bias can still be present in the training data, and cross-validation can help to detect it.

Answer C is incorrect because an increase in training error rate in every k-fold cross-validation round would indicate that the model is underfitting the data, which is not desirable. Therefore, we want to see a decrease in training error rate as we increase k.

Answer E is incorrect because k-fold cross-validation can be used with any type of model, including linear regression.

Therefore, the correct answer is D. Every k-fold cross-validation round has a very similar error rate to the rate of all the other rounds, indicating that the model is not overfitting or underfitting the data and that the training data is not biased.