Feature Selection and Dimensionality Reduction for Police Officer Allocation Model | MLS-C01 Exam Answer

Visualization Techniques for Feature Selection in Machine Learning | MLS-C01 Exam Answer

Question

You work for the city planning department of a major metropolitan city in the United States.

You are on the city's machine learning team where you are responsible for creating a model that assists in the resource planning for police officers in the city.

Each day the city has to assign police officers to each precinct according to varying parameters.

You have data from the past several years for your city and other US cities of similar makeup.

You are in the process of deciding which algorithm to use for your police officer allocation model.

Your goal is to predict the police officer allocation size for a given shift based on your dataset features. Your city dataset has the following features: Infrastructure average age Square feet Citizens Precincts Residences Population density Police officers Before you select an algorithm, you need to perform feature selection and dimensionality reduction of your features.

You only want to select features that are relevant to your training dataset, i.e., dimensionality reduction.

This process will help you prevent overfitting and increase computation efficiency through simplification of the feature set. You have chosen to use visualization techniques to decide which of your 7 features are the most important or most relevant, in other words, which of your 7 features are needed to train your model properly. Which visualization techniques are the best to use for this purpose? (Choose TWO)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answers: C, D.

Option A is incorrect.

A catplot is used to show the relationship between a numerical value and one or more categorical variables using a visualization such as violinplot, boxenplot, etc.

But you are trying to show relationships between pairs of data, such as police officers to population density or police officers to precincts.

Option B is incorrect.

A swarm plot is used to show categorical scatter plot data that shows the distribution of values for each feature.

But you are trying to show relationships between pairs of data, such as police officers to population density or police officers to precincts.

Option C is correct.

A pairs plot is used to show the relationship between pairs of features and the distribution of one of the variables in relation to the other.

This is what you need to analyze.

You want to see which features correlate well with your police officers' features.

Option D is correct.

A covariance matrix shows the degree of correlation between two features.

This visualization gives you a numerical representation of the correlation, where the pairs plot gives you a visual representation as points plotted in two-dimensional space.

Option E is incorrect.

Entropy represents the measure of randomness in your features.

This measure would not help you find the correlation between your target feature, police officers, and the potential training features.

Reference:

Please see the article titled Feature Selection and Dimensionality Reduction Using Covariance Matrix Plot, the article titled Visualizing Data with Pairs Plots in Python, and the article titled What is Entropy and why Information gain matter in Decision Trees?

For feature selection and dimensionality reduction, visualization techniques can be very useful as they allow you to explore relationships between variables and identify patterns or trends in the data. The following are two visualization techniques that are commonly used for feature selection and dimensionality reduction:

  1. Pairs plot: A pairs plot, also known as a scatterplot matrix, allows you to visualize the relationship between pairs of variables in your dataset. It is especially useful for identifying correlations between variables, which can help you decide which features to include in your model. In a pairs plot, each variable is plotted against every other variable in a scatterplot, and the diagonal of the plot shows a histogram of each variable. This type of visualization can quickly reveal which variables are most closely related to the target variable, and which variables may be redundant.

  2. Covariance matrix: A covariance matrix is a matrix that shows the covariance between pairs of variables in your dataset. Covariance is a measure of how two variables vary together, and a positive covariance indicates that the two variables tend to increase or decrease together, while a negative covariance indicates that one variable tends to increase as the other decreases. By examining the covariance matrix, you can identify which variables are highly correlated with each other, and which variables may be redundant or irrelevant to your model.

Therefore, the best visualization techniques to use for feature selection and dimensionality reduction in this case are the pairs plot and covariance matrix. These techniques will help you identify the most important and relevant features in your dataset, which can improve the accuracy and efficiency of your model.