You work as a machine learning specialist for an online flight booking service that finds the lowest cost flights based on user input such as flight dates, origin, destination, number of layovers, and other factors.
Your machine learning team gathers data from many sources, including airline flight databases, credit agencies, etc., to use in your model.
You need to transform this data for your model training and in real-time for your model inference requests.
What is the most efficient way to build these transformations into your model workflow?
Click on the arrows to vote for the correct answer
A. B. C. D.Correct Answer: B.
Option A is incorrect.
Apache Spark Streaming supports real-time processing of streaming data.
Apache Spark Streaming cannot be used to reuse the data transforms you developed for your training model in your inference requests.
Option B is correct.
You can use SageMaker Spark ML Serving containers that reuse the data transformers developed for training models in your inference requests.
Option C is incorrect.
Apache Spark MLlib is a machine learning library that runs algorithms designed to scale across clusters for classification, regression, clustering, collaborative filtering.
Apache Spark MLlib cannot be used to reuse the data transforms you developed for your training model in your inference requests.
Option D is incorrect.
SageMaker lifecycle configurations are used to install packages or sample notebooks on your notebook instances, configure notebook instance networking and security, or use a shell script to customize notebook instances.
SageMaker lifecycle configurations cannot be used to reuse the data transforms you developed for your training model in your inference requests.
References:
Please see the Amazon Machine Learning Lens AWS Well-Architected Framework titled Feature Engineering (https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/feature-engineering.html),
The Amazon SageMaker developer guide titled Deploy an Inference Pipeline (https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html),
The Amazon SageMaker developer guide titled Use SparkML Serving with Amazon SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/sparkml-serving.html),
The Amazon SageMaker developer guide titled Customize a Notebook Instance Using a Lifecycle Configuration Script (https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html),
The Apache Spark page titled Streaming (https://spark.apache.org/streaming/),
The Apache Spark page titled MLlib (https://spark.apache.org/mllib/)
The most efficient way to build data transformations into a machine learning model workflow for training and real-time inference depends on the specific use case, the data sources, and the desired output. However, in the context of the given scenario, the most appropriate option would be to use a SageMaker lifecycle configuration to automate the reuse of the data transforms developed for the training model.
SageMaker is a fully-managed service by Amazon Web Services (AWS) that provides developers and data scientists with tools for building, training, and deploying machine learning models. SageMaker lifecycle configurations provide a way to automate the setup of your SageMaker notebook instances, training jobs, and real-time inference endpoints.
The advantages of using a SageMaker lifecycle configuration for this scenario include:
Efficient use of resources: SageMaker lifecycle configurations can reduce the time and cost of building and deploying machine learning models by automating repetitive tasks, such as setting up dependencies, installing libraries, and configuring the environment.
Reuse of data transforms: The data transforms developed for the training model can be reused in the real-time inference endpoint by using a SageMaker lifecycle configuration to automatically deploy the same data transformations in the inference pipeline.
Scalability: SageMaker provides a scalable infrastructure to train and deploy machine learning models. This allows the model to handle high volumes of real-time inference requests.
On the other hand, Apache Spark ML Streaming, Apache Spark ML Serving, and Apache Spark MLlib are all part of the Apache Spark ecosystem, which is an open-source distributed computing framework for large-scale data processing. These options may provide a more flexible and customizable approach for building data transformations into a machine learning workflow, but may require more resources and expertise to set up and maintain compared to using SageMaker.
In conclusion, the most efficient way to build data transformations into a machine learning model workflow for training and real-time inference for the given scenario is to use a SageMaker lifecycle configuration to automate the reuse of the data transforms developed for the training model.