A team is currently making use of Kinesis streams for streaming web clicks for an application.
There is now a requirement to enable a data analyst team to perform SQL queries on the live data for analytical purposes.
Which of the following can be added to the architecture to achieve this requirement?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer - A.
An example of this is given in the AWS documentation.
Options B and C are incorrect as these do not give dynamic options for using SQL queries for analysis.
Option D is incorrect since AWS RDS should be used to host OLTP database.
For more information on this use case, please refer to the below URL.
https://aws.amazon.com/blogs/big-data/querying-amazon-kinesis-streams-directly-with-sql-and-spark-streaming/The best option to achieve this requirement is to create an EMR Cluster with Spark, stream the data from Kinesis streams to Spark, and use Spark to perform the queries. Here's why:
A. Create an EMR Cluster with Spark. Stream the data from Kinesis streams to Spark. Use Spark to perform the queries.
This option is the most suitable for the requirement, as Spark provides an efficient and flexible way to process large amounts of data and perform complex analytics. With Spark, the data can be ingested from Kinesis streams, and then processed and queried in real-time. The Spark cluster can be configured to run in a streaming mode that constantly processes data as it comes in from Kinesis streams.
B. Use the KCL library to directly perform the SQL queries on the incoming data.
This option is not recommended, as the KCL (Kinesis Client Library) is not designed for SQL querying, but rather for consuming and processing data from Kinesis streams. While it is possible to use a third-party SQL library to query the data, this approach may not be as efficient or scalable as using Spark.
C. Embed the SQL queries while developing the application using the KPL Library.
The KPL (Kinesis Producer Library) is designed for publishing data to Kinesis streams, and not for querying data. Therefore, this option is not recommended.
D. Use the Data Pipeline service to transfer the data to AWS RDS. Use normal SQL queries for the analysis.
While this option may work, it adds an unnecessary step of transferring the data to an RDS database before it can be queried. This can add latency and increase costs. Additionally, the live nature of the data may not be preserved if it is stored in a database before querying.
In summary, the best option to achieve the requirement of enabling a data analyst team to perform SQL queries on live data from Kinesis streams is to create an EMR cluster with Spark and stream the data from Kinesis streams to Spark for processing and querying.