You work as a machine learning specialist for a city that wants to monitor air quality to address air pollution in their environment.
You and your machine learning specialist team need to forecast the city air quality in parts per million of contaminants over the next week, taking into account weather, traffic conditions, and other pollutant contributors.
You are building your model using daily data from the last year as your data source.
Your team has decided to use SageMaker Studio to leverage its collaborative notebooks feature. Which model and SageMaker Studio image will provide the best results for your team in the most efficient manner?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: C.
Option A is incorrect.
The problem of forecasting the air quality is a regression problem best solved by the Linear Learner built-in algorithm, not the k-Nearest Neighbors SageMaker built-in algorithm.
Also, there is no Base TensorFlow image available in SageMaker Studio.
You would have to create a custom image to use TensorFlow in SageMaker Studio.
This would be far less efficient than using an image that is already available with SageMaker Studio.
Option B is incorrect.
The problem of forecasting the air quality is a regression problem best solved by the Linear Learner built-in algorithm, not the Random Cut Forest SageMaker built-in algorithm.
Also, there is no Base R image available in SageMaker Studio.
You would have to create a custom image to use R in SageMaker Studio.
This would be far less efficient than using an image that is already available with SageMaker Studio.
Option C is CORRECT.
The Linear Learner SageMaker built-in algorithm is the best choice from the options given to solve this regression problem.
Also, the SageMaker Studio Base Python [python-3.6] image is a valid choice of an available SageMaker Studio image.
Option D is incorrect.
While the Linear Learner SageMaker built-in algorithm is the best choice from the options given to solve this regression problem, there is no Base Scala image available in SageMaker Studio.
You would have to create a custom image to use Scala in SageMaker Studio.
This would be far less efficient than using an image that is already available with SageMaker Studio.
Reference:
Please see the AWS SageMaker developer guide titled Available Amazon SageMaker Images.
Please refer to the AWS SageMaker developer guide titled Bring your own SageMaker image.
The best model and SageMaker Studio image for forecasting the air quality of a city over the next week, taking into account various factors such as weather, traffic conditions, and other pollutant contributors using the daily data from the last year would depend on several factors, such as the data characteristics, model performance, and team's expertise.
Option A suggests using the SageMaker Studio Base TensorFlow [tensorflow-2.3.0] image and the k-Nearest-Neighbors algorithm on the single time series consisting of the full year of data with a predictor_type of regressor. While k-NN can be a simple and interpretable model, it may not be the best choice for this type of forecasting problem. k-NN is a memory-based algorithm that requires the entire training dataset to make predictions. It may not be scalable to larger datasets or well-suited for time series forecasting, which requires accounting for temporal dependencies. Additionally, the TensorFlow image may not be the best option for this problem, as TensorFlow is generally used for more complex deep learning models.
Option B suggests using the SageMaker Studio Base R [r-4.0.3] image and the Random Cut Forest algorithm on the single time series consisting of the full year of data. Random Cut Forest is an algorithm that can handle high-dimensional and large-scale datasets, making it a good choice for this problem. However, R may not be the most popular choice for machine learning practitioners, and it may not be as user-friendly as Python for data preprocessing and feature engineering.
Option C suggests using the SageMaker Studio Base Python [python-3.6] image and the Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of regressor. Linear Learner is a popular algorithm for regression problems, and it can handle large datasets with high-dimensional features. Additionally, Python is widely used in the machine learning community, with a rich ecosystem of libraries for data manipulation, visualization, and modeling. However, using a single time series may not capture the seasonal and temporal patterns of air quality, which may require more sophisticated models such as ARIMA, LSTM, or Prophet.
Option D suggests using the SageMaker Studio Base Scala [scala-2.13.3] image and the Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of classifier. Linear Learner can also be used for classification tasks, but it may not be suitable for this problem since we are interested in predicting continuous values of air quality in ppm, not discrete categories. Additionally, Scala may not be the most popular language for machine learning, and it may require more expertise to set up the environment and implement the models.
Overall, Option C seems to be the most reasonable choice for this problem, as it combines the benefits of using Python, a widely-used language for machine learning, and the Linear Learner algorithm, a scalable and efficient algorithm for regression problems. However, the final choice would depend on several factors, such as the size and complexity of the dataset, the required level of accuracy and interpretability, and the team's expertise and preference.