Streaming Data Ingestion for SageMaker Feature Store | AWS ML Specialty Exam Prep

Approaches for Streaming Data Ingestion into SageMaker Feature Store

Question

You work as a machine learning specialist for a research department at a large university.

Your team of machine learning specialists is responsible for all aspects of the machine learning lifecycle, including creating the data repositories used by your research scientists for their data science work.

Your team has built a SageMaker infrastructure for your data scientists where you stream in data from many sources, such as satellite feeds, IoT devices like underwater sensors, and many others.

You have recently implemented SageMaker Feature Store, and you are now implementing the ingestion of batch data from your streaming data sources.

Which of the following options are viable approaches to streaming data into your SageMaker Feature Store? (Select TWO)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Correct Answers: A and B.

Option A is correct.

Apache Kafka can be used as a streaming data source where features are directly fed to the online feature store for feature creation.

Option B is correct.

Kinesis Data Analytics, together with the use of a Lambda function, can be used as a streaming data source where features are directly fed to the online feature store for feature creation.

Option C is incorrect.

Apache Spark Streaming is not supported as a direct streaming feed into SageMaker Feature Store.

Option D is incorrect.

Apache Spark ML Serving is not supported as a direct streaming feed into SageMaker Feature Store.

Option E is incorrect.

Apache Flink is not supported as a direct streaming feed into SageMaker Feature Store.

References:

Please see the Amazon SageMaker developer guide titled Data Sources and Ingestion (https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-ingest-data.html),

The AWS Machine Learning blog titled Using streaming ingestion with Amazon SageMaker Feature Store to make ML-backed decisions in near-real time (https://aws.amazon.com/blogs/machine-learning/using-streaming-ingestion-with-amazon-sagemaker-feature-store-to-make-ml-backed-decisions-in-near-real-time/)

SageMaker Feature Store is a fully managed feature store service that allows data scientists and developers to securely store, update, retrieve, and share machine learning features. It is designed to help with building and scaling machine learning models faster by eliminating the need for manually managing feature engineering code.

To ingest data from streaming data sources, there are multiple viable approaches to stream the data into SageMaker Feature Store. The two options are:

A. Stream your data sources through Apache Kafka into Feature Store: Apache Kafka is a distributed streaming platform that is used to publish and subscribe to streams of records, where records are the key data units in Kafka. Kafka provides a scalable and fault-tolerant way to stream data, making it a popular choice for streaming large amounts of data. In this approach, you would configure your Kafka producer to produce messages to SageMaker Feature Store, which can be consumed by a SageMaker Feature Store Python SDK or SageMaker SDK.

B. Stream your data sources through Kinesis Data Analytics and a Lambda function into Feature Store: Amazon Kinesis Data Analytics is a fully managed service that helps analyze real-time streaming data with SQL or Apache Flink. Kinesis Data Analytics provides an easy way to analyze and process streaming data, and its integration with AWS Lambda enables you to execute arbitrary code on the data stream. In this approach, you would use Kinesis Data Analytics to stream data into SageMaker Feature Store through a Lambda function that reads data from the stream and writes it to Feature Store.

C, D, and E are not viable approaches to streaming data into SageMaker Feature Store for the following reasons:

C. Stream your data sources through Apache Spark Streaming into Feature Store: Apache Spark Streaming is a scalable and fault-tolerant stream processing system built on top of Apache Spark. While Spark Streaming is a viable choice for streaming large amounts of data, it is not recommended to stream data directly into SageMaker Feature Store due to potential latency and reliability issues.

D. Stream your data sources through Apache Spark ML Serving into Feature Store: Apache Spark ML Serving is a scalable and fault-tolerant machine learning serving system built on top of Apache Spark. While Spark ML Serving is a viable choice for serving machine learning models, it is not recommended to stream data directly into SageMaker Feature Store due to potential latency and reliability issues.

E. Stream your data sources through Apache Flink into Feature Store: Apache Flink is a distributed stream processing system for processing large amounts of data in real-time. While Flink is a viable choice for streaming large amounts of data, it is not recommended to stream data directly into SageMaker Feature Store due to potential latency and reliability issues.

In summary, the two viable approaches to streaming data into SageMaker Feature Store are streaming data through Apache Kafka and streaming data through Kinesis Data Analytics and a Lambda function.