You work as a machine learning specialist for a publishing company.
The company has labeled a historical dataset of publication sales data.
Using this labeled data, you need to predict how many copies of a given publication should be produced each month. Which machine learning algorithm type should you use to generate your predictions?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: A.
Option A is CORRECT.
You are trying to solve a “how many” question, and your data is labeled.
These two factors lead to the choice of linear regression as the best option from those given.
Option B is incorrect.
The principal component analysis is used for dimensionality reduction, not for solving predictions of “how many” problems.
Also, it is an unsupervised algorithm.
We have labeled data.
So we should use a supervised algorithm.
Option C is incorrect.The random cut forest is used primarily as an unsupervised algorithm for detecting anomalous data points within a data set.
Since we have labeled data, we will use a supervised algorithm.
Option D is incorrect.
Logistic regression is used to solve “yes/no” or binary predictions, not “how many” predictions.
Reference:
Please see the Amazon Machine Learning developer guide titled Regression Model Insights.
Please refer to the Amazon SageMaker developer guide titled Random Cut Forest (RCF) Algorithm.
Please refer to the Amazon SageMaker developer guide titled Principal Component Analysis (PCA) Algorithm.
The machine learning algorithm type that would be best suited for predicting the number of copies of a given publication to produce each month is Linear Regression (Option A).
Linear Regression is a supervised learning algorithm that is used to predict a continuous output value (also known as a dependent variable) based on one or more input features (also known as independent variables). In this case, the input features could be attributes related to the publication, such as the author, genre, price, and marketing budget, while the output variable would be the number of copies sold each month.
Linear Regression works by finding the best-fit line that can explain the relationship between the input features and the output variable. The line is determined by minimizing the sum of the squared distances between the actual values and the predicted values. Once the line is found, it can be used to predict the number of copies sold based on the input features.
Principal Component Analysis (Option B) is an unsupervised learning algorithm that is used for dimensionality reduction, data visualization, and feature extraction. It is not suitable for predicting a continuous output value.
Random Cut Forest (Option C) is an anomaly detection algorithm that is used to detect outliers in a dataset. It is not suitable for predicting a continuous output value.
Logistic Regression (Option D) is a supervised learning algorithm that is used to predict a binary output value (i.e., 0 or 1) based on one or more input features. It is not suitable for predicting a continuous output value.