You have a large amount of data stored in an S3 bucket, and the data arrives at a fixed time.
You want to analyze the data directly using standard SQL.
You also need a service to help you to create databases and tables.
The service should be able to categorize the data and automatically infer database and table schema.
Which combinations of services would you choose? (Select TWO.)
Click on the arrows to vote for the correct answer
A. B. C. D. E.Correct Answer - B, E.
There are two requirements for the question.
First, the data should be analyzed with standard SQL.
Second, a service is required to generate databases for SQL queries.
Amazon Athena and AWS Glue should be used.
You can configure AWS Glue to connect to the data sources in Amazon S3 and prepare the databases for Athena.
Details can be found in https://docs.aws.amazon.com/athena/latest/ug/data-sources-glue.html.
Option A is incorrect: Amazon QuickSight enables you to visualize the data and share it through data dashboards.
This is not required in this question.
Option B is CORRECT: Because Athena helps to analyze unstructured, semi-structured, and structured data stored in Amazon S3 using ANSI SQL.
Option C is incorrect: Because tools such as SQL Workbench are used for performing further SQL queries with SQL scripts.
It is not required in this scenario.
Option D is incorrect: Because Amazon EMR is a platform to process and analyze vast amounts of data.
It is not needed as the question asks for standard SQL queries.
Option E is CORRECT: You can set up a crawler in AWS Glue to retrieve schema information automatically.
After the connection is made, Athena can analyze the table using standard SQL and get results in seconds.
To analyze large amounts of data stored in an S3 bucket using standard SQL, two AWS services that can be used are Amazon Athena and AWS Glue. Amazon QuickSight and Amazon EMR can also be used for data analysis, but they have different use cases.
Amazon Athena is a query service that makes it easy to analyze data in Amazon S3 using standard SQL. It is a serverless service, so you don't need to manage any infrastructure. Athena can handle large datasets, and results are available in seconds. You can create databases and tables directly in Athena or use the AWS Glue Data Catalog to store metadata such as table and column names.
AWS Glue is a fully managed ETL (Extract, Transform, Load) service that makes it easy to move data between data stores. It provides a managed Apache Spark environment to perform ETL jobs and data transformations. AWS Glue can automatically infer the schema of your data, categorize it, and populate the AWS Glue Data Catalog with table definitions. You can then use Athena or other SQL-based tools to query the data.
Amazon QuickSight is a business intelligence service that provides a way to connect to a variety of data sources, including Amazon S3, and create visualizations and dashboards. While it can be used to analyze data, it does not provide direct SQL access to the data, so it may not be the best choice for this use case.
Amazon EMR is a fully managed Hadoop and Spark cluster service that provides a way to process large amounts of data using big data frameworks such as Apache Hadoop and Apache Spark. While it can be used for data analysis, it may be overkill for this use case, as it requires managing and scaling a cluster.
To summarize, the two services that would be most appropriate for analyzing data stored in an S3 bucket using standard SQL and automatically inferring database and table schema are Amazon Athena and AWS Glue.