You are working for an insurance company which has recently introduced a ML model to predict the risk level of new customers.
During the training phase, the model showed a very high accuracy of 99.5% and its test accuracy was above 96%.However, in production use, for real cases the overall accuracy is around 60% which is rather low.
You know that Azure AutoML service can help you solve the problem.
Which are the best practices you should follow?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: C.
Option A is incorrect because one of the possible reasons for a model's overfitting is that there are too many features in the training dataset.
In this case, adding further features to the dataset doesn't solve the problem.
Option B is incorrect because performance of an underfitting model on the training data is poor, which means that the model is not able to grab the actual relationships between variables.
Your model's performance on training data is excellent, so this is not a case of underfitting.
Option C is CORRECT because if there are too many features in the training dataset, the model tends to “memorize” some patterns which are not necessarily valid for real-life cases.
By decreasing the number of features you can decrease the model's complexity.
This, together with AutoML's cross-validation functionality might help prevent overfitting.
Option D is incorrect because performance of an underfitting model on the training data is poor, which means that the model is not able to grab the relationships among variables.
Your model's performance on training data is excellent, so this is not a case of underfitting.
Reference:
The scenario described in the question implies that the ML model is performing well during training but not in the production environment. This is a classic case of overfitting. Overfitting occurs when the model fits too closely to the training data and learns to capture the noise in the data. As a result, the model becomes too complex and fails to generalize to new data.
Therefore, the best answer is A. The following are the reasons why:
The model is probably overfitting: The high training accuracy and low production accuracy suggest that the model is overfitting to the training data. Overfitting can be controlled by regularization, which is a technique used to add a penalty term to the loss function to avoid overfitting.
Increase the number of features in your dataset: Adding more relevant features to the dataset can help the model capture more information about the problem, making it more accurate. However, adding too many irrelevant features can worsen the overfitting problem.
Use AutoML's regularization to control overfitting: Azure AutoML has built-in regularization techniques that can be used to control overfitting. Regularization techniques such as L1 and L2 regularization can be used to add a penalty term to the loss function.
Cross-validation: Cross-validation is a technique used to evaluate the performance of a model. It involves splitting the dataset into several parts and using each part as a test set while training on the remaining parts. Cross-validation can help detect overfitting and underfitting problems in a model.
Therefore, in summary, the best approach is to use AutoML to increase the number of relevant features in the dataset, control overfitting using regularization, and use cross-validation while training to ensure that the model generalizes well to new data.