Amazon BDS-C00: AWS Certified Big Data - Specialty Exam: Understanding Aggregation Queries with Amazon Kinesis Data Analytics

Capturing Aggregates Based on Each Client for Every 15 Minutes using Amazon Kinesis Data Analytics

Question

HikeHills.com (HH) is an online specialty retailer that sells clothing and outdoor refreshment gear for trekking, go camping, boulevard biking, mountain biking, rock hiking, ice mountaineering, skiing, avalanche protection, snowboarding, fly fishing, kayaking, rafting, road and trace running, and many more. HH runs their entire online infrastructure on multiple java based web applications and other web framework applications running on AWS.

The HH is capturing clickstream data and use custom-build recommendation engine to recommend products which eventually improve sales, understand customer preferences and already using AWS Kinesis Streams (KDS) to collect events and transaction logs and process the stream.

Multiple departments from HH use different streams to address real-time integration and induce analytics into their applications and uses Kinesis as the backbone of real-time data integration across the enterprise. HH uses a VPC to host all their applications and is looking at integration of kinesis into their web application.

To understand the network flow behavior based on every 15 minutes, HH is looking at aggregating data based on the VPC logs for analytics.

VPC Flow Logs have a capture window of approximately 10 minutes.

What kind of queries can be used to capture aggregates based on each client for every 15 mins using Amazon Kinesis Data Analytics.

Select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: A.

Option A is correct -Stagger windows query, A query that aggregates data using keyed time-based windows that open as data arrives.

The keys allow for multiple overlapping windows.

This is the recommended way to aggregate data using time-based windows.

VPC Flow Logs have a capture window of approximately 10 minutes.

But they can have a capture window of up to 15 minutes if you're aggregating data on the client.

Stagger windows are ideal for aggregating these logs for analysis.

https://docs.aws.amazon.com/kinesisanalytics/latest/dev/stagger-window-concepts.html

Option B is incorrect -Tumbling Windows query, A query that aggregates data using distinct time-based windows that open and close at regular intervals.

https://docs.aws.amazon.com/kinesisanalytics/latest/dev/tumbling-window-concepts.html

Option C is incorrect -Sliding windows query, A query that aggregates data continuously, using a fixed time or rowcount interval.

https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sliding-window-concepts.html

Option D is incorrect -Continuous Query is a query over a stream executes continuously over streaming data.

This continuous execution enables scenarios, such as the ability for applications to continuously query a stream and generate alerts.

https://docs.aws.amazon.com/kinesisanalytics/latest/dev/continuous-queries-concepts.html

The scenario described in the question involves capturing and analyzing VPC logs to understand network flow behavior. To achieve this, HH plans to use Amazon Kinesis Data Analytics. The requirement is to aggregate data based on each client for every 15 minutes.

To perform the desired aggregation, we need to use windowing functions in Kinesis Data Analytics. There are several types of windowing functions available, including Stagger Windows queries, Tumbling Windows queries, Sliding Windows queries, and Continuous queries.

Stagger Windows queries: Stagger Windows divide the input stream into a sequence of overlapping windows that are staggered in time. For example, if we use a stagger window of 15 minutes, the first window will contain events from the first minute to the 15th minute, the second window will contain events from the second minute to the 16th minute, and so on. Stagger Windows are useful when we want to maintain overlap between the windows and capture all events.

Tumbling Windows queries: Tumbling Windows divide the input stream into non-overlapping windows of a fixed duration. For example, if we use a tumbling window of 15 minutes, the first window will contain events from the first minute to the 15th minute, the second window will contain events from the 16th minute to the 30th minute, and so on. Tumbling Windows are useful when we want to analyze events in fixed time intervals and do not want to maintain overlap between the windows.

Sliding Windows queries: Sliding Windows divide the input stream into a sequence of overlapping windows of a fixed duration. For example, if we use a sliding window of 15 minutes with a slide interval of 5 minutes, the first window will contain events from the first minute to the 15th minute, the second window will contain events from the sixth minute to the 20th minute, and so on. Sliding Windows are useful when we want to analyze events in fixed time intervals and maintain some overlap between the windows.

Continuous queries: Continuous queries process events as they arrive without dividing them into windows. Continuous queries are useful when we want to perform real-time analysis on the input stream.

In the given scenario, we need to aggregate data based on each client for every 15 minutes. Tumbling Windows queries would be the most appropriate option to achieve this requirement as it divides the input stream into non-overlapping windows of a fixed duration. We can use a tumbling window of 15 minutes to capture the events for every 15 minutes and aggregate the data based on each client.