Fully Managed ETL Service for Data Categorization, Cleaning, and Enrichment | MSP Bank

ETL Service for Data Lake: Simplify Data Categorization, Cleaning, and Enrichment

Question

MSP Bank, Limited is a leading Japanese monetary institution that provides a full range of financial products and services to both institutional and individual customers.

It is headquartered in Tokyo.

MSP Bank is hosting their existing infrastructure on on premise DC and AWS and maintains a hybrid environment. MSP Bank hosts multiple web applications, CRM and ERP running on premise while moving storage, compute, DWH and AI running out of AWS.

Also MSP is launching new applications running on AWS environment.

MSP Banks hosts their Development, Testing and Production VPC to maintain different environments and maintains VPN connectivity between on premise DC and AWS. MSP Bank is planning to build a data lake on all the log files stored in S3, captured from different applications running out of on premise and AWS and also identified data sets captured out of CRM, ERP and other Business applications

MSP Bank is looking at fully managed ETL service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores.

What tool can help? select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer B.

Option A is incorrect - Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.

Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run.

Athena scales automatically-executing queries in parallel-so results are fast, even with large datasets and complex queries.

https://docs.aws.amazon.com/athena/latest/ug/what-is.html

The AWS Glue Data Catalog is persistent metadata store.

It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore, It provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data

https://docs.aws.amazon.com/glue/latest/dg/components-

Option B is correct -AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores.

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

The AWS Glue Data Catalog is persistent metadata store.

It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore, It provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data

https://docs.aws.amazon.com/glue/latest/dg/components-

Option C is incorrect -Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service.

KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events.

The data collected is available in milliseconds to enable real-time analytics use cases such as real- time dashboards, real-time anomaly detection, dynamic pricing, and more.

https://aws.amazon.com/kinesis/data-streams/

The AWS Glue Data Catalog is persistent metadata store.

It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore, It provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data

https://docs.aws.amazon.com/glue/latest/dg/components-

Option D is incorrect - The AWS Glue Data Catalog is persistent metadata store.

It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore, It provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data

https://docs.aws.amazon.com/glue/latest/dg/components-

MSP Bank wants to build a data lake on all the log files stored in S3, captured from different applications running out of on premise and AWS, and identified data sets captured out of CRM, ERP, and other business applications. They are looking for a fully managed ETL service that can categorize data, clean it, enrich it, and move it reliably between various data stores.

Among the options provided, the best tool that can help MSP Bank is B. AWS Glue.

AWS Glue is a fully managed ETL (Extract, Transform, Load) service that makes it easy to move data between data stores. It can automatically discover and catalog data, generate ETL code to transform data, and manage metadata. Glue integrates with a wide variety of AWS data sources and targets, including S3, RDS, DynamoDB, and Redshift, making it an ideal tool for building a data lake.

Glue consists of three components: the Glue Data Catalog, the ETL Engine, and the Job Scheduler. The Glue Data Catalog is a central metadata repository that stores metadata about data sources, targets, transformations, and jobs. The ETL Engine generates Python or Scala code to transform data, and the Job Scheduler runs the ETL jobs on a schedule or in response to events.

Glue can also handle semi-structured data such as JSON, Parquet, and ORC, making it easier to work with different data formats. It can also perform schema inference to automatically discover the structure of the data.

AWS Athena, another option provided in the answer choices, is a query service that allows you to analyze data directly in S3 using SQL. While Athena can be used to query data in a data lake, it is not an ETL service and cannot perform data transformations or data loading. The Glue Data Catalog can be used with Athena to store metadata about data sources and partitions.

AWS Kinesis is a real-time data streaming service that allows you to ingest and process streaming data. While Kinesis can be used to process real-time data, it is not an ETL service and cannot perform data transformations or data loading.

AWS Glue Catalog, the last option provided, is a component of Glue that serves as a central metadata repository. It is not an ETL service and cannot perform data transformations or data loading.

In summary, MSP Bank should choose B. AWS Glue as the best option for their requirements. Glue is a fully managed ETL service that can categorize data, clean it, enrich it, and move it reliably between various data stores, making it an ideal tool for building a data lake.