Gluebush.com Architecture for Log and Event Data Collection, Processing, and Analytics

AWS Certified Big Data - Specialty Exam - BDS-C00

Question

Gluebush.com is a British online confidential advertisement and public website.

Classified ads are either free or paid for depending on the product category and the geographical market. While the largest category of advertisements on Gluebush.com is "goods for sale”, the site also supports around 100,000 motors listings across the UK at any one time, with an extensive social media presence on Twitter and Facebook, with 22,000 and 471,000 followers, respectively.

Gluebush.com uses social media for communications and information about the brand as well as competitions and campaigns. Gluebush.com runs multiple business applications both web and mobile based on AWS.

Gluebush.com wants to collect log and event data from web servers, mobile devices, pre-process the data and process the data to feed live dashboards, and load data into data warehouse build on Redshift and on S3 for long term storage.

The DWH process the data for further analytics.

Gluebush.com want to extend the capabilities like search, document management, integration into Data Lake built on EMR, etc using the same stream without impacting performance besides above 3 purposes mentioned above.

Please advise key artifacts of end to end architecture.

select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answer : C.

Option A is incorrect - Kinesis data stream producer library cannot be used to read log files.

we need kinesis agent to collect the information from the log files.

Enhanced Fan-out consumers is mandatory if we need to ensure performance is not degraded with additional consumers.

shared fan-out basically distributes the total shard read capacity to any number of consumers and eventually impact performance.

Kinesis data streams producer is any application that puts user data records into a Kinesis data stream.

The KPL is an easy-to-use, highly configurable library that helps you write to a Kinesis data stream and acts as an intermediary between your producer application code and the Kinesis Data Streams API actions.

KPL library cannot be used to continuously monitors a set of files and sends new data to your stream.

https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html

An Amazon Kinesis Data Streams application is a consumer of a stream that commonly runs on a fleet of EC2 instances.

The shard Read throughput is fixed at a total of 2 MiB/sec per shard.

If there are multiple consumers reading from the same shard, they all share the throughput.

https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

Option B is incorrect - Kinesis data stream producer library cannot be used to read log files.

we need kinesis agent to collect the information from the log files.

Enhanced Fan-out consumers is mandatory if we need to ensure performance is not degraded with additional consumers.

Kinesis data streams producer is any application that puts user data records into a Kinesis data stream.

The KPL is an easy-to-use, highly configurable library that helps you write to a Kinesis data stream and acts as an intermediary between your producer application code and the Kinesis Data Streams API actions.

KPL library cannot be used to continuously monitors a set of files and sends new data to your stream.

https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html

An Amazon Kinesis Data Streams application is a consumer of a stream that commonly runs on a fleet of EC2 instances.

Scales as consumers register to use enhanced fan-out.

Each consumer registered as enhanced fan-out gets its own read throughput per shard, up to 2 MiB/sec, independently of other consumers.

https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

Option C is correct -we need kinesis agent to collect the information from the log files.

Enhanced Fan-out consumers is mandatory if we need to ensure performance is not degraded with additional consumers.

Kinesis Agent is a stand-alone Java application that allows to collect and process data to Kinesis Data Streams.

The agent continuously monitors a set of files and sends new data to your stream.

https://docs.aws.amazon.com/streams/latest/dev/writing-with-agents.html

An Amazon Kinesis Data Streams application is a consumer of a stream that commonly runs on a fleet of EC2 instances.

Scales as consumers register to use enhanced fan-out.

Each consumer registered as enhanced fan-out gets its own read throughput per shard, up to 2 MiB/sec, independently of other consumers.

https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

Option D is incorrect -we need kinesis agent to collect the information from the log files.

Enhanced Fan-out consumers is mandatory if we need to ensure performance is not degraded with additional consumers.

shared fan-out basically distributes the total shard read capacity to any number of consumers and eventually impact performance.

Kinesis Agent is a stand-alone Java application that allows to collect and process data to Kinesis Data Streams.

The agent continuously monitors a set of files and sends new data to your stream.

https://docs.aws.amazon.com/streams/latest/dev/writing-with-agents.html

An Amazon Kinesis Data Streams application is a consumer of a stream that commonly runs on a fleet of EC2 instances.

The shard Read throughput is fixed at a total of 2 MiB/sec per shard.

If there are multiple consumers reading from the same shard, they all share the throughput.

https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

Option E is incorrect - we need kinesis agent to collect the information from the log files.

API's cannot read data from log files.

Enhanced Fan-out consumers is mandatory if we need to ensure performance is not degraded with additional consumers.

shared fan-out basically distributes the total shard read capacity to any number of consumers and eventually impact performance.

We can develop producers using the Amazon Kinesis Data Streams API with the AWS SDK for Java but API SDK cannot be used to continuously monitors a set of files and sends new data to your stream.

https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-sdk.html

An Amazon Kinesis Data Streams application is a consumer of a stream that commonly runs on a fleet of EC2 instances.

The shard Read throughput is fixed at a total of 2 MiB/sec per shard.

If there are multiple consumers reading from the same shard, they all share the throughput.

https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

The optimal end-to-end architecture for Gluebush.com to collect log and event data from web servers, mobile devices, pre-process the data and process the data to feed live dashboards, and load data into data warehouse build on Redshift and on S3 for long-term storage while extending the capabilities like search, document management, and integration into Data Lake built on EMR, would be:

B. data collection, pre-processing, and writing of data into data streams using KPL and reading of data using shared enhanced fan-out consumers using KCL library and writing to downstream applications using connector libraries. Existing consumer applications would be sufficient.

Explanation:

The architecture should consist of the following components:

  1. Data Collection: To collect log and event data from web servers and mobile devices, Kinesis Data Streams would be an appropriate service as it can collect and process large amounts of data in real-time from various sources. Kinesis Data Streams provide a durable, scalable, and real-time data streaming platform. Kinesis Producer Library (KPL) is a tool that enables efficient and reliable data ingestion into Kinesis Data Streams from various data sources, such as web servers and mobile devices.

  2. Pre-processing: Once the data is collected, it needs to be pre-processed to perform various data transformations, filtering, and aggregation. AWS Lambda is a compute service that enables serverless event-driven computing. It can be used to perform pre-processing of data in real-time as it arrives in Kinesis Data Streams.

  3. Data Storage: To store the pre-processed data, Amazon S3 would be a suitable service for long-term storage. It provides durability, scalability, and high availability.

  4. Data Processing: To process the data further for analytics, Amazon Redshift is a suitable service. It is a fully managed, petabyte-scale data warehouse service that can handle massive amounts of data and run complex queries.

  5. Extending capabilities: To extend the capabilities like search, document management, and integration into Data Lake built on EMR, Kinesis Data Firehose can be used to deliver data to various services like Elasticsearch, Amazon S3, and Amazon EMR. Kinesis Data Analytics can be used to perform real-time analytics on data streams.

  6. Data consumption: To read data from Kinesis Data Streams, a shared enhanced fan-out consumer model should be used. This model enables multiple consumers to read from the same stream, providing increased read throughput, and ensuring that all consumers receive all data in real-time. The Kinesis Client Library (KCL) is a software library that makes it easy to build enhanced fan-out consumers in Java, Python, and other languages.

  7. Downstream Applications: Connector libraries can be used to integrate Kinesis Data Streams with downstream applications like Redshift, Elasticsearch, and Amazon S3. Existing consumer applications should be sufficient as they can be easily integrated with connector libraries.

Therefore, option B is the correct answer: data collection, pre-processing, and writing of data into data streams using KPL and reading of data using shared enhanced fan-out consumers using KCL library and writing to downstream applications using connector libraries. Existing consumer applications would be sufficient.