You are a machine learning specialist for a manufacturing company that has ingested structured and semi-structured manufacturing process data into their S3 buckets in their corporate data lake.
Your data scientists now want to use SQL to run queries on this data to build manufacturing process KPI dashboards using a business intelligence tool. Which option gives your data scientists the analysis and visualization capabilities they need most efficiently?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: B.
Option A is incorrect.
Using AWS Data Pipeline to transform and load the data into RDS is not the most efficient option listed.
Also, Kibana is best used as a visualization tool with AWS Elasticsearch, not RDS.
Option B is CORRECT.
The AWS Glue crawler is the best option listed for making your manufacturing data available to a query tool like Athena by cataloging the data in your Glue data catalog.
Athena is built to leverage the Glue data catalog to enable simple, efficient query capabilities for data stored in S3
Finally, QuickSight integrates directly with Athena through its Athena dataset connector.
QuickSight has KPI dashboard capabilities built into it, making it the best BI visualization tool for your data scientists.
Option C is incorrect.
Using AWS Aurora as the data store for your data scientist visualization work is far too complex.
You would have to create the Aurora schema and database implementation.
The Glue data catalog and Athena option are much more efficient.
Option D is incorrect.
Using Lambda and Kinesis Data Analytics as the data source provider solution for your data scientist visualization work is far too complex.
You would have to write the Lambda function code to process your manufacturing data.
The AGlue data catalog and Athena option are much more efficient.
Reference:
Please see the Towards Data Science article titled Getting Started with Data Analysis on AWS.
Please refer to the AWS Big Data blog titled Analyzing Data in S3 using Amazon Athena.
Please review the AWS Big Data blog titled Build a Data Lake Foundation with AWS Glue and Amazon S3.
Option B is the most efficient solution for the given scenario.
Explanation:
The manufacturing company has ingested structured and semi-structured manufacturing process data into their S3 buckets in their corporate data lake. The data scientists need to use SQL to run queries on this data to build manufacturing process KPI dashboards using a business intelligence tool.
Option A suggests transforming the data into the Parquet format using AWS Data Pipeline and then loading it into RDS. However, RDS is not the most suitable database for running analytical queries on large datasets. Kibana is not an ideal data visualization tool for SQL queries and KPI dashboard building.
Option C suggests transforming the data and then loading it into Aurora using an AWS Batch ETL job. Aurora is a good option for storing large datasets, but it is a managed relational database service designed for OLTP (Online Transaction Processing) workloads, not OLAP (Online Analytical Processing) workloads. QuickSight is a good data visualization tool, but using it with Aurora requires additional data preparation and management, which is not efficient for the given scenario.
Option D suggests transforming the data into the Parquet format using a Lambda function and using Kinesis Data Analytics to run queries and build KPI dashboard visualizations. However, Kinesis Data Analytics is designed for processing and analyzing streaming data, not static data stored in S3 buckets. This option also introduces additional complexity by requiring the use of Lambda functions and Kinesis Data Analytics, which is not necessary for the given scenario.
Option B suggests cataloging the data using a Glue crawler to populate the Glue data catalog. Glue is a fully managed extract, transform, and load (ETL) service that can crawl data sources, infer schema, and create ETL jobs based on the discovered schema. By using Glue, the data scientists can use Athena to run SQL queries on the manufacturing data stored in S3. Athena is a serverless interactive query service that makes it easy to analyze data directly in S3 using SQL. Finally, the data scientists can build their KPI dashboards using the QuickSight Athena dataset feature, which allows them to easily connect to Athena and visualize the data using a variety of chart types and custom visualizations. This option is the most efficient and cost-effective solution for the given scenario, as it requires minimal data preparation, no database setup or management, and provides an easy-to-use and powerful data visualization tool.