Predicting Score Difference Outcomes in Online Gambling | Multicollinearity Solutions

Addressing Multicollinearity in Predictive Models

Question

You work as a machine learning specialist for an online gambling software company.

Your online app allows users to gamble on the outcomes of sporting matches, such as football, basketball, cricket, etc.

Your machine learning team is responsible for predicting the score difference outcomes of these matches so your company can set the betting line.

For example, team A will beat team B by 7.5 wickets, where 7.5 is the betting line.

Your data sources for your models contain many features, such as team power ranking, previous match score differences, player injury reports, etc.

You have transformed your data to make all features numeric (either counts or continuous values)

However, through your data discovery you have noticed that some of your features are multicollinear.

How can you address the multicollinearity of your features?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: C.

Option A is incorrect.

Linear Discriminant Analysis (LDA) is used to reduce dimensionality in multi-class classification problems that predict a categorical target.

We are trying to solve for a continuous target, match point difference, or spread.

Also, you want to solve for very low variance when you use a more appropriate dimensionality reduction algorithm.

Option B is incorrect.

Linear Discriminant Analysis (LDA) is used to reduce dimensionality in multi-class classification problems that predict a categorical target.

We are trying to solve for a continuous target, match point difference, or spread.

Option C is correct.

Using Principal Component Analysis (PCA) to reduce the dimensionality of your feature set by dropping the components that have very low variance will remove the multicollinearity of your features.

Option D is incorrect.

Principal Component Analysis (PCA) is the correct choice of algorithm to remove the multicollinearity of your features.

However, you want to drop the components that have very low variance, not the components that have high variance.

Reference:

Please see the Towards Data Science article titled Multicollinearity - How does it create a problem? (https://towardsdatascience.com/https-towardsdatascience-com-multicollinearity-how-does-it-create-a-problem-72956a49058), and the Kaggle page titled Principal Component Analysis (https://www.kaggle.com/ryanholbrook/principal-component-analysis), the Towards Data Science article titled A beginner's guide to dimensionality reduction in Machine Learning (https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e), the Machine Learning Mastery article titled Linear Discriminant Analysis for Machine Learning (https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/), the StatsTest.com article titled Linear Discriminant Analysis (https://www.statstest.com/linear-discriminant-analysis/), and the Machine Learning Mastery article titled Linear Discriminant Analysis for Dimensionality Reduction in Python (https://machinelearningmastery.com/linear-discriminant-analysis-for-dimensionality-reduction-in-python/)

Multicollinearity is a common problem in statistical modeling, where two or more predictor variables are highly correlated. In this situation, it becomes difficult to isolate the effect of each variable on the target variable, as their effects become intertwined. In the context of machine learning, multicollinearity can lead to overfitting and reduced model performance.

To address multicollinearity in the features, dimensionality reduction techniques can be applied. Two such techniques are Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA).

LDA is a technique that aims to find a linear combination of features that characterizes or separates two or more classes of objects or events. LDA is a supervised learning method that maximizes the separation between classes while minimizing the variance within classes. In the context of dimensionality reduction, LDA can be used to project the original features into a lower-dimensional space while preserving the most important discriminatory information.

PCA, on the other hand, is an unsupervised learning technique that identifies the principal components of the data. Principal components are linear combinations of the original features that capture the most variance in the data. PCA aims to reduce the dimensionality of the data by projecting it onto a lower-dimensional space while preserving as much of the original variance as possible.

Both LDA and PCA can be used to address multicollinearity in the features by reducing the dimensionality of the data. Once the data has been projected into a lower-dimensional space, the resulting components can be examined to identify which components have the highest or lowest variance. The components with high variance are the ones that capture the most information in the data and should be retained, while the components with low variance can be dropped as they are likely redundant.

In the context of the given scenario, the best approach would be to use PCA to reduce the dimensionality of the data, then drop the resulting components that have high variance. This approach would ensure that the most important information in the data is preserved while removing redundant features that contribute to multicollinearity.

Therefore, the correct answer is option D: Use Principal Component Analysis (PCA) to reduce your model's dimensionality, then drop the resulting components that have high variance.