A supermarket chain had a big data analysis system deployed in AWS.
The system has the raw data such as clickstream or process logs in S3
An m3.large EC2 instance transformed the data into other formats and saved it to another S3 bucket.
It was then moved to Amazon Redshift.
Click on the arrows to vote for the correct answer
A. B. C. D.Correct Answer - A, D.
AWS Glue is a service to discover data, transform it, and make it available for search and querying.
AWS Glue can make all your data in S3 immediately available for analytics without moving the data.
Option A is CORRECT because Crawler is a key component in AWS Glue that can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog.
Option B is incorrect because AWS Glue will generate ETL code in Scala or Python rather than Java.
Option C is incorrect because AWS Glue does not generate triggers by default.
Moreover, Cron expressions that lead to rates faster than 5 minutes are not supported.
Option D is CORRECT because the Glue Data Catalog stores the metadata in the AWS Cloud which is readily available for analysis.
The given scenario describes a big data analysis system in AWS, where raw data is stored in S3, and an EC2 instance transforms it and saves it to another S3 bucket before moving it to Amazon Redshift. The question asks which of the following statements is true about AWS Glue, a fully managed extract, transform, and load (ETL) service that makes it easy to move data between data stores.
A. AWS Glue contains a crawler that connects to the S3 bucket and scans the dataset. Then the service creates metadata tables in the data catalog.
This statement is correct. AWS Glue includes a crawler that can connect to various data sources, including Amazon S3, JDBC databases, and DynamoDB tables, and extract metadata such as table definitions, column types, and file formats. The crawler scans the S3 bucket containing the raw data, and then the service creates metadata tables in the Glue Data Catalog, a central metadata repository that stores information about the data assets in your account.
B. AWS Glue automatically generates code in Java to extract data from the source and transform it to match the target schema.
This statement is partially true. AWS Glue provides an ETL workflow that allows you to create and run Python or Scala scripts that extract, transform, and load data. While AWS Glue doesn't automatically generate code in Java, it does provide a visual interface that allows you to create ETL jobs without writing any code.
C. By default, AWS Glue creates a scheduler to trigger the activated tasks every minute.
This statement is false. AWS Glue provides a scheduler that you can use to trigger ETL jobs on a schedule, but it doesn't create a scheduler by default. You have to configure the scheduler to run your jobs according to your requirements.
D. AWS Glue has a central metadata repository (Glue Data Catalog). The Glue Data Catalog is available for analysis immediately.
This statement is correct. AWS Glue provides a central metadata repository called the Glue Data Catalog, which stores metadata about data assets such as databases, tables, and partitions. The Glue Data Catalog is available for analysis immediately and can be queried using SQL or the Glue API. The metadata is stored in a highly available and durable manner, making it accessible to all AWS services and accounts within the same region.
In summary, option A and D are correct statements about AWS Glue, while options B and C are false or partially true. AWS Glue can be used to automate the ETL process for big data analysis systems, and it provides a centralized metadata repository that stores information about your data assets.