Machine Learning with AWS Glue and Spark ML: IoT Sensor Data Transformation | Exam Prep

Building IoT Sensor Data Transformer Tasks with Spark ML in AWS Glue

Question

You work for a retail athletic footwear company.

Your company has just completed the production of a new running shoe that contains IoT sensors in the shoe.

These sensors are used to enhance the runner's running experience by giving detailed data about foot plant, distance, acceleration, gait, and other data points for use in personal running performance analysis. You are on the machine learning team assigned the task of building a machine learning model to use the shoe IoT sensor data to make predictions of shoe life expectancy based on user wear and tear of the shoes.

Instead of just using raw running miles as the predictor of shoe life, your model will use all of the IoT sensor data to produce a much more accurate prediction of the remaining life of the shoes. You are in the process of building your dataset for training your model and running inferences from your model.

You need to clean the IoT sensor data before using it for training or use it to provide inferences from your inference endpoint.

You have decided to use Spark ML jobs within AWS Glue to build your feature transformation code.

Which machine learning packages/engines are the best choices for building your IoT sensor data transformer tasks in the simplest way possible? (Select THREE)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

Answers: A, B, C.

Option A is correct.

AWS Glue serializes Spark ML jobs into MLeap containers.

You add these MLeap containers to your inference pipeline.

Option B is correct.

Apache Spark MLlib is a machine learning library that lets you build machine learning pipeline components to transform your data using the full suite of standard transformers such as tokenizers, OneHotEncoders, normalizers, etc.

Option C is correct.

The SparkML Serving Container allows you to deploy an Apache Spark ML pipeline in SageMaker.

Option D is incorrect.

Batch Transformer is a feature of SageMaker that allows you to get inferences for an entire dataset.

Batch Transform is not an Apache SparkML feature.

Option E is incorrect.

There is no Apache SparkML feature called MLTransform.

Option F is incorrect.

There is no Apache SparkML feature called MapReduce.

Reference:

Please see the Amazon SageMaker developer guide titled Feature Processing with Spark ML and Scikit-learn, the MLeap page, the SageMaker SparkML Serving Container GitHub repo, the Apache Spark MLlib overview page, the Apache Spark MLlib docs page titled Extracting, transforming, and selecting features, the Amazon SageMaker developer guide titled Deploy a Model on Amazon SageMaker Hosting Services, and the Amazon SageMaker developer guide titled Get Inferences for an Entire Dataset with Batch Transform.

In this scenario, the retail athletic footwear company has built a running shoe that contains IoT sensors to provide detailed data about foot plant, distance, acceleration, gait, and other data points. The machine learning team has been assigned the task of building a model to predict the shoe life expectancy based on the user's wear and tear of the shoes.

The team has decided to use Spark ML jobs within AWS Glue to build the feature transformation code. The question asks which machine learning packages/engines are the best choices for building the IoT sensor data transformer tasks in the simplest way possible, and the correct answers are A, B, and D: MLeap, MLlib, and SparkML Batch Transform.

Here's a brief explanation of each of the choices:

A. MLeap: MLeap is a machine learning model serialization framework that allows for seamless integration between Spark and other frameworks, such as TensorFlow or Scikit-learn. It can be used to optimize, bundle, and deploy models in production environments, making it an excellent choice for building IoT sensor data transformer tasks.

B. MLlib: MLlib is a distributed machine learning library for Spark. It provides a set of machine learning algorithms for classification, regression, clustering, and collaborative filtering, among others. MLlib also includes feature transformers and pipelines, which can be used to transform and preprocess the IoT sensor data.

C. SparkML Serving Container: SparkML Serving Container is a tool that allows you to deploy Spark ML models as a REST API. It can be used to serve predictions from a trained model, but it's not the best choice for building IoT sensor data transformer tasks.

D. SparkML Batch Transform: SparkML Batch Transform is a tool that allows you to transform large datasets using Spark ML pipelines. It can be used to preprocess the IoT sensor data before training the model or to transform the data before serving predictions.

E. MLTransform: MLTransform is not a standard machine learning package or engine. It's possible that the question refers to the SageMaker ML Transform job, which is a tool that allows you to preprocess data using a SageMaker processing job. However, this is not the best choice for building IoT sensor data transformer tasks with Spark.

F. SparkML MapReduce: SparkML MapReduce is not a standard tool or package. MapReduce is a programming model for processing large datasets in parallel, and it's typically used with Hadoop. Spark, on the other hand, uses a different approach to distributed computing, and it's not necessary to use MapReduce with Spark.

In summary, the best choices for building IoT sensor data transformer tasks in the simplest way possible are MLeap, MLlib, and SparkML Batch Transform. These tools can be used to preprocess and transform the IoT sensor data before training the machine learning model or serving predictions.