Efficient Feature Engineering for Real-Time Streaming Data in Machine Learning

Most Efficient Way for Feature Engineering of Real-Time Streaming Data

Question

You work as a machine learning specialist for a software company that offers real-time interactive sports viewing app for mobile phones and tablets.

You gather real-time streaming sports statistics and game action data and use the streaming data to produce real-time analytics and active predictions of the likely outcome of the game.

To produce your prediction, you need to use several machine learning models that use the real-time streaming data as their training and inference data sources.

Since the real-time streaming game data is delivered from several different sources, the format and schema of the data need transformation and sanitation.

Which option is the most efficient way to perform the feature engineering of your real-time streaming data for use in your training and inference requests?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: C.

Option A is incorrect.

You can ingest your streaming data using Kinesis Data Firehose and use the kinesis-firehose-process-record Lambda blueprint for transformation.

However, you need to stream the output of your Kinesis Data Firehose into both the offline and online feature store FeatureGroups since you wish to train using your Feature Store groups and produce real-time inferences using your online Feature Store FeatureGroup.

Option B is incorrect.

You can ingest your streaming data using Kinesis Data Firehose, and you could use the kinesis-process-record Lambda blueprint for transformation.

However, you need to stream the output of your Kinesis Data Firehose into both the offline and online feature store FeatureGroups since you wish to train using your offline Feature Store groups and produce real-time inferences using your online Feature Store FeatureGroup.

Option C is correct.

You can ingest your streaming data using Kinesis Data Firehose and use the kinesis-firehose-process-record Lambda blueprint for transformation.

You will also want to stream the output of your Kinesis Data Firehose into both the offline and online feature store FeatureGroups since you wish to train using your offline Feature Store groups and produce real-time inferences using your online Feature Store FeatureGroup.

Option D is incorrect.

You can ingest your streaming data using Kafka, and you could use the kinesis-process-record Lambda blueprint for transformation.

However, you need to stream the output of your Kinesis Data Firehose into both the offline and online feature store FeatureGroups since you wish to train using your offline Feature Store groups and produce real-time inferences using your online Feature Store FeatureGroup.

References:

Please see the AWS Machine Learning blog titled Understanding the key capabilities of Amazon SageMaker Feature Store (https://aws.amazon.com/blogs/machine-learning/understanding-the-key-capabilities-of-amazon-sagemaker-feature-store/),

The Amazon SageMaker page titled Amazon SageMaker Feature Store (https://aws.amazon.com/sagemaker/feature-store/),

The Amazon SageMaker developer guide titled Create, Store, and Share Features with Amazon SageMaker Feature Store (https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html),

The Amazon SageMaker developer guide titled Create Feature Groups (https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-create-feature-group.html),

The Amazon SageMaker Examples page titled Fraud Detection with Amazon SageMaker FeatureStore (https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/sagemaker_featurestore_fraud_detection_python_sdk.html),

The Amazon Kinesis Data Firehose developer guide titled Amazon Kinesis Data Firehose Data Transformation (https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html)

To perform feature engineering of the real-time streaming data, we need to transform and sanitize the data to be used in machine learning models. AWS offers several services to ingest, process, and store real-time streaming data.

In this scenario, we have real-time streaming sports statistics and game action data from several sources, and we need to use this data to produce real-time analytics and active predictions of the likely outcome of the game. We also need to use several machine learning models that use the real-time streaming data as their training and inference data sources.

To efficiently perform feature engineering of the real-time streaming data, we can use AWS Kinesis Data Firehose, which is a fully managed service that can capture, transform, and load streaming data into AWS services such as S3, Redshift, Elasticsearch, and SageMaker.

Option A: In this option, we ingest the real-time streaming data using Kinesis Data Firehose and use the kinesis-firehose-process-record Lambda blueprint for transformation. This blueprint allows us to process the data record-by-record and transform the data as needed. The output of the Kinesis Data Firehose is streamed into SageMaker offline feature store FeatureGroup. SageMaker offline feature store FeatureGroup is a managed, scalable, and durable repository for storing and sharing curated feature sets that can be used for model training and inference.

Option B: In this option, we also ingest the real-time streaming data using Kinesis Data Firehose and use the kinesis-process-record Lambda blueprint for transformation. This blueprint allows us to process the data record-by-record and transform the data as needed. The output of the Kinesis Data Firehose is also streamed into SageMaker offline feature store FeatureGroup. This option is similar to option A, except that it uses a different Lambda blueprint for transformation.

Option C: In this option, we again use Kinesis Data Firehose with the kinesis-firehose-process-record Lambda blueprint for transformation. However, the output of the Kinesis Data Firehose is streamed into both SageMaker offline and online feature store FeatureGroups. SageMaker online feature store FeatureGroup is a real-time feature store that can be used to store and retrieve feature data for real-time inference. This option provides both offline and online feature stores, which can be useful for different use cases.

Option D: In this option, we ingest the real-time streaming data using Kafka, a distributed streaming platform, and use the kinesis-process-record Lambda blueprint for transformation. The output of the Kinesis Data Firehose is streamed into SageMaker online feature store FeatureGroup. This option uses Kafka instead of Kinesis Data Firehose for data ingestion and provides only an online feature store.

Based on the requirements of the scenario, the most efficient way to perform feature engineering of the real-time streaming data is Option A. It uses Kinesis Data Firehose with the kinesis-firehose-process-record Lambda blueprint for transformation and provides an offline feature store that can be used for model training and inference.