Querying Structured and Unstructured Data for Feature Correlation and Dimensionality Analysis | MLS-C01 Exam Answer

Querying Data for Feature Correlation and Dimensionality Analysis

Question

You work on a machine learning team at a manufacturing company that produces fire detection products.

You are building a fire detection analytics model, the source data store of which has structured and unstructured data stored in an S3 bucket.

You are in the data engineering and data analysis phase of the machine learning lifecycle.

At this point, you need to use SQL to run queries on your source data to determine feature correlation and dimensionality.

Which option allows you to query the data with the least amount of effort?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: C.

Option A is incorrect.

Transforming the data using a Lambda function and then querying the data in Kinesis Data Analytics requires you to write a Lambda function.

This is more effort than crawling the data using a Glue crawler then using Athena to query the data directly on S3.

Option B is incorrect.

Using AWS Batch to perform ETL on the data and then loading the data into an ElasticSearch cluster requires more effort than crawling the data using a Glue crawler, then using Athena to query the data directly on S3.

Option C is correct.

Crawling the data using a Glue crawler and then querying the data directly on S3 using Athena only requires you to write your SQL code.

You don't need to write a Lambda function or create an Aurora or RDS database.

Option D is incorrect.

With this option, you have to create an RDS database and then load your data into the RDS instance using Data Pipeline.

This requires more effort than crawling the data using a Glue crawler then using Athena to query the data directly on S3.

References:

Please see the AWS Data Pipeline developer guide titled What is AWS Data Pipeline? (https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html),

The AWS Glue developer guide titled Defining Crawlers (https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html),

The AWS Batch user guide titled What Is AWS Batch? (https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html),

The Amazon Kinesis Data Analytics for SQL Applications Developer Guide SQL developer guide titled What Is Amazon Kinesis Data Analytics for SQL Applications? (https://docs.aws.amazon.com/kinesisanalytics/latest/dev/what-is.html)

The most efficient option for querying structured and unstructured data stored in an S3 bucket with SQL is to use a combination of AWS Glue and Athena. Therefore, the correct answer is option C.

AWS Glue is a fully managed extract, transform, and load (ETL) service that allows you to easily prepare and load data for analytics. AWS Glue can automatically generate ETL code to transform and load data from various sources, including S3, into target data stores such as Redshift, RDS, and others. Glue crawlers can automatically discover the schema and metadata of your data stored in S3 and then create Glue data catalogs to store this information.

Athena is an interactive query service that allows you to easily analyze data stored in S3 using standard SQL. You can use Athena to run ad-hoc queries or build complex analysis workflows. Athena is serverless, which means that you don't need to manage any infrastructure, and you only pay for the queries that you run.

Therefore, by using AWS Glue to prepare the data and create a data catalog, you can easily query the structured and unstructured data in S3 using Athena. This option allows you to query the data with the least amount of effort.

Option A involves using Kinesis Data Analytics, which is a real-time data analytics service, and Lambda to transform the data. While this option can work, it is not the most efficient way to query data in S3 with SQL.

Option B involves using AWS Batch to extract, transform, and load (ETL) the data and an ElasticSearch cluster to run the queries. While this option can work, it is more complex than option C and may require more effort to set up and manage.

Option D involves using RDS and Data Pipeline to transform and load the data. While this option can work, it requires more infrastructure and may be more complex to set up and manage than option C. Additionally, RDS is not an ideal target data store for unstructured data.