You are a machine learning specialist working for a state government water safety department.
The state needs to monitor water quality across all of its counties to ensure water contamination levels remain within acceptable thresholds.
Your machine learning team is responsible for producing a forecasting report of water contaminants in parts per million for the next month, every month, across the state.
Your team has daily data from the last year available as a starting point for your model. Which SageMaker model will give you the best results for your monthly forecasting report?
Click on the arrows to vote for the correct answer
A. B. C. D.Correct Answer: B.
Option A is incorrect.
This problem requires a regression algorithm since we solve a real number (continuous) prediction value or label: water contaminants in parts per million.
We are not producing a classification of unacceptable versus acceptable.
Also, we should use a single time series for this type of regression, not multiple time series.
Option B is correct.
This option is correct since the problem requires a regression algorithm because we solve a real number (continuous) prediction value or label: water contaminants in parts per million.
The SageMaker Linear Learner algorithm is one of the go-to algorithms for regression on the Machine Learning exam.
Option C is incorrect.
This option is incorrect because the Random Cut Forest algorithm is primarily used as an unsupervised algorithm for detecting anomalous data points within a data set.
You would not use an RCF algorithm to solve a regression problem.
Option D is incorrect.
This option is incorrect because while you can use the kNN algorithm for regression problems, this option states the use of a predictor_type of the classifier.
This problem requires a regression algorithm.
So you would need to use a predictor_type of the regressor.
Also, we should use a single time series for this type of regression, not multiple time series.
References:
Please see the Amazon SageMaker developer guide titled K-Nearest Neighbors (k-NN) Algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html),
The Amazon SageMaker developer guide titled Linear Learner Algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html),
The Amazon SageMaker developer guide titled Random Cut Forest (RCF) Algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html)
The best SageMaker model to use for the monthly forecasting report of water contaminants in parts per million across the state is B. Use a single time series of the full previous year of data as your input to a SageMaker Linear Learner built-in algorithm with a predictor_type of the regressor.
Here's why:
Linear Learner built-in algorithm: The Linear Learner algorithm is a popular choice for regression problems because it is fast, efficient, and easy to use. It works well with large datasets, like the daily data from the last year that the machine learning team has available. Linear regression models are also interpretable, meaning they allow you to understand which features are most important for making accurate predictions.
Single time series of the full previous year of data: The problem of forecasting water contaminants in parts per million is a time series problem, which means that the data is ordered by time. In this case, the data is daily data from the last year. Using a single time series of the full previous year of data as input to the model is the most appropriate approach, as it allows the model to learn the patterns and trends in the data over time. Using multiple time series would not make sense in this case, as there is only one variable (water contaminants) that needs to be predicted.
Predictor_type of the regressor: The predictor_type determines whether the model will be used for classification or regression. Since the problem is a regression problem (predicting a continuous value, the water contaminants level), a predictor_type of the regressor is appropriate. The other options, classifier and kNN classifier, are not appropriate for this problem.
Random Cut Forest (RCF) built-in algorithm: The RCF algorithm is a good choice for anomaly detection, which is not the goal of this problem. RCF would be more appropriate if the goal was to detect unusual spikes in water contaminants levels.
In summary, using a single time series of the full previous year of data as input to a SageMaker Linear Learner built-in algorithm with a predictor_type of the regressor is the best choice for this problem.