You work as a machine learning specialist for a real estate company.
You are using the Kaggle housing prices data as your experimentation data to optimize your model before using your model on the real estate data for your area of the country.
You have a hypothesis that you can predict the price of a real estate property based on the foundation type.
You have your data from Kaggle.
But you want to make sure your model is not overly influenced by outliers. What is the quickest way to identify outliers in your data?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: C.
Option A is incorrect.
You can find your outliers using a quantitative assessment.
But it will involve more effort and, therefore, more time than visualizing your data.
Option B is incorrect.
The z-score of a data point shows how many standard deviations the data point is from the mean.
This would help you find your outliers.
But it will involve more effort and, therefore, more time than visualizing your data.
Option C is correct.
With large datasets, such as the real estate data you are using in this problem, the quickest way to find outliers is to visualize your data.
The best plots for this task are the scatter plot and the box plot.
(See the article titled How to Make your Machine Learning Models Robust to Outliers)
Option D is incorrect.
Visualization is the quickest and easiest way to find outliers, but the network and/or correlation matrix charting choices will not show outliers.
They are used to represent relations between data points as nodes.
These relationships would not give you any information about the extremity of a data point.
Reference:
Please see the article titled How to Make your Machine Learning Models Robust to Outliers and the article titled A Brief Overview of Outlier Detection Techniques.
The quickest way to identify outliers in your data is to visualize your data using scatter plots and/or box plots (Option C). Outliers are data points that are significantly different from the majority of the data points and can have a significant impact on the statistical analysis and modeling of the data.
Scatter plots can help identify outliers by plotting the data points in a two-dimensional space, where each axis represents a different variable. Outliers can be identified as points that are far away from the cluster of other data points. Scatter plots can also help identify any patterns or relationships between variables.
Box plots provide a visual representation of the distribution of the data. They show the median, quartiles, and any outliers in the data set. Outliers can be identified as points that fall outside of the whiskers, which represent 1.5 times the interquartile range (IQR) below the first quartile or above the third quartile. Box plots can also help identify any skewness or symmetry in the data distribution.
Arranging the data points from lowest to highest and calculating the median (Option A) is not a reliable way to identify outliers because the median is not affected by outliers. Qualitative assessment is also subjective and can vary from person to person.
Calculating the Z-score for your data points (Option B) can identify data points that are far from the mean of the data set. Z-score is a measure of how many standard deviations a data point is away from the mean. However, this method requires calculating the mean and standard deviation of the data set and can be time-consuming.
Visualizing your data using network and correlation matrices (Option D) can help identify relationships between variables. However, it is not a reliable method to identify outliers.
In conclusion, visualizing your data using scatter plots and/or box plots is the quickest and most reliable way to identify outliers in your data set.