AWS Certified Big Data - Specialty Exam: FlexiToner Data Lake and Athena Solutions

Building a Data Lake with FlexiToner's S3 Files and Enabling Data Lake as a Service with Athena

Prev Question Next Question

Question

FlexiToner uses AWS to query 10 years' worth of historical data and get results, with the flexibility to explore data for deeper insights.

Movable Ink provides real -time personalization of marketing emails based on a wide range of user, device, and contextual data, driving higher response rates and better customer experiences.

Also FlexiToner hosts log files captured from web servers running out of different EC2 machines FlexiToner has lot of data assets available in structured, semi-structured and unstructured data forms containing emails, logs, structured data from databases in csv files with formats in CSV, LOG, JSON and binary formats like Parquet and ORC.

FlexiToner is interested to build a data lake out of all the files stored on S3 and provide Data Lake as a service to users from different departments based on pay per queries run.

FlexiToner understands that Athena provides this facility OOTB.Consider the below structure in S3

When AWS Glue Crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table.

What solutions are possible? Select 2 options.

Answers

A. If the schema for table1 and table2 are similar, and a single data source is set to s3://bucket01/folder1/ in AWS Glue, the crawler may create a single table with two partition columns: one partition column that contains table1 and table2, and a second partition column that contains partition1 through partition5.

B. If the schema for table1 and table2 are similar, and uses different data source is set to s3://bucket01/folder1/table1 and s3://bucket01/folder1/table2 in AWS Glue, the crawler may create two tables

C. If the schema for table1 and table2 are similar, and uses different data sources is set to s3://bucket01/folder1/table1 and s3://bucket01/folder1/table2 in AWS Glue, the crawler may create only one table

D. If the schema for table1 and table2 are similar, and a single data source is set to s3://bucket01/folder1/ in AWS Glue, the crawler may create a single table with single partition column: one partition column that contains table1 and table2

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: A, B.

Option A is correct - When an AWS Glue Crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table.

In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables.

One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.

If the schema for table1 and table2 are similar, and a single data source is set to s3://bucket01/folder1/ in AWS Glue, the crawler may create a single table with two partition columns: one partition column that contains table1 and table2, and a second partition column that contains partition1 through partition5

https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html

Option B is correct - When an AWS Glue Crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table.

In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables.

One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.

To have the AWS Glue crawler create two separate tables, set the crawler to have two data sources, s3://bucket01/folder1/table1/ and s3://bucket01/folder1/table2

https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html

Option C is incorrect - multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table.

In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables.

One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.

To have the AWS Glue crawler create two separate tables, set the crawler to have two data sources, s3://bucket01/folder1/table1/ and s3://bucket01/folder1/table2 https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html.

Option D is incorrect - When an AWS Glue Crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table.

In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables.

One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.

https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html

Sure, I'd be happy to provide a detailed explanation of the possible solutions to the given scenario.

The scenario describes a situation where FlexiToner wants to build a data lake out of all the files stored on S3 and provide Data Lake as a service to users from different departments based on pay per queries run. The data assets available in S3 are in structured, semi-structured, and unstructured data forms containing emails, logs, structured data from databases in CSV files with formats in CSV, LOG, JSON, and binary formats like Parquet and ORC.

FlexiToner wants to use AWS Glue Crawler to scan Amazon S3 and detect multiple directories to determine the root for a table in the directory structure, and which directories are partitions for the table. The question asks which solutions are possible if the schema for table1 and table2 are similar, and if a single data source or different data sources are set in AWS Glue.

Option A suggests that if the schema for table1 and table2 are similar, and a single data source is set to s3://bucket01/folder1/ in AWS Glue, the crawler may create a single table with two partition columns. One partition column contains table1 and table2, and a second partition column contains partition1 through partition5.

This solution is possible because the data for both tables is stored in the same directory structure (s3://bucket01/folder1/) and the schema for both tables is similar. Therefore, a single table can be created with two partition columns that contain table1 and table2 data.

Option B suggests that if the schema for table1 and table2 are similar and uses a different data source set to s3://bucket01/folder1/table1 and s3://bucket01/folder1/table2 in AWS Glue, the crawler may create two tables.

This solution is possible because the data for table1 and table2 is stored in different directories (s3://bucket01/folder1/table1 and s3://bucket01/folder1/table2) and the schema for both tables is similar. Therefore, two separate tables can be created.

Option C suggests that if the schema for table1 and table2 are similar and uses different data sources set to s3://bucket01/folder1/table1 and s3://bucket01/folder1/table2 in AWS Glue, the crawler may create only one table.

This solution is not possible as the data for table1 and table2 is stored in different directories (s3://bucket01/folder1/table1 and s3://bucket01/folder1/table2). Therefore, AWS Glue Crawler will create two separate tables.

Option D suggests that if the schema for table1 and table2 are similar, and a single data source is set to s3://bucket01/folder1/ in AWS Glue, the crawler may create a single table with a single partition column that contains table1 and table2.

This solution is not ideal as it will not provide efficient partitioning for the tables. Since the data for both tables is stored in the same directory structure (s3://bucket01/folder1/), it is better to create two partition columns, one for each table, to enable efficient partitioning.

In conclusion, the possible solutions for the given scenario are A and B. Option A is applicable if the data for both tables is stored in the same directory structure (s3://bucket01/folder1/) and the schema for both tables is similar. Option B is applicable if the data for table1 and table2 is stored in different directories (s3://bucket01/folder1/table1 and s3://bucket01/folder1/table2) and the schema for both tables is similar.

Prev Question Next Question