You are working for a bank which has recently introduced an ML inference service to predict the churn of its customers.
During the training phase, the model showed an excellent accuracy of 99.5% and its test accuracy was above 96%.However, in production use, for real cases the overall accuracy is around 60% which is rather low.
It seems that the model tends to overfit data.
Which of the following actions are recommended practices?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: C.
Option A is incorrect because limiting the complexity of the algorithm is a feature in auto ML and serves to prevent models from being overly complicated and overfitting.
The technique is usually applied for decision-tree algorithms to limit the depth of the trees.
Option B is incorrect because the number of observations (data points) should be increased, while the number of features should be decreased in order to prevent models from overfitting.
Option C is CORRECT because in general, getting more data for training the simplest and best possible way to prevent overfitting, which typically also increases accuracy.
Cross validation, a technique of running the training process with several subsets of training data also helps avoiding overfitting.
Option D is incorrect because the number of features should be decreased in order to prevent models from overfitting.
Adding more features results in an adverse effect.
Reference:
The scenario presented suggests that the model is overfitting to the training data, which means that the model is performing very well on the data it has already seen during training, but it is not generalizing well to new and unseen data. This is evidenced by the high accuracy during training (99.5%) and the relatively lower accuracy in production use (60%).
To address this issue, it is recommended to take the following actions:
C. Get more observations for training; apply cross-validation: The accuracy of a machine learning model can be improved by training it on more data. Therefore, acquiring more observations for training can help the model to generalize better. Cross-validation is another useful technique that can be applied to evaluate the performance of the model and tune its hyperparameters.
D. Use regularization with hyperparameter tuning; add more features to the dataset: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. This can help to limit the complexity of the model and improve its generalization. Hyperparameter tuning can be used to find the optimal values for the regularization parameter. Adding more features to the dataset can also help the model to better capture the underlying patterns in the data.
A. Decrease the number of features used for training; increase the model's complexity: Decreasing the number of features used for training can help to reduce the complexity of the model and prevent overfitting. Increasing the model's complexity may exacerbate the problem of overfitting.
B. Decrease the number of observations used for training; add more features to the training data: Reducing the number of observations used for training may not be an effective way to address overfitting. Adding more features to the training data can actually increase the complexity of the model and worsen the overfitting problem.
In summary, the recommended actions are to get more observations for training, apply cross-validation, use regularization with hyperparameter tuning, and possibly add more features to the dataset.