You work for an online fashion retailer as a machine learning specialist.
You are on a team of machine learning specialists and data scientists who have been given the responsibility of centralizing your company's product, customer, supplier, and materials data in one source.
This new data source will be used for analytics and making business decisions using KPIs (Key Performance Indicators)
Your company has many different data sources where their product, customer, supplier, and materials data is stored.
These data repositories are also housed on several different database technologies. When you load the various data sources into your new centralized data source, you need to clean and classify the data as well.
What is the most expeditious and efficient way to create this new centralized data source?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: D.
Option A is incorrect.
Using Amazon EMR and its built-in machine learning tools will work to extract, transform, and load your disparate data sources into your S3 data lake.
But it is not the quickest or simplest option given.
Option B is incorrect.
Using AWS Glue and its crawlers will work to extract, transform, and load your disparate data sources into your S3 data lake.
But it is not the quickest or simplest option given.
Option C is incorrect.
Using Amazon Kinesis Data Firehose and its lambda integration will work to extract, transform, and load your disparate data sources into your S3 data lake.
But it is not the quickest or simplest option given.
Option D is correct.
AWS Lake Formation builds on the capabilities of AWS Glue to simplify the creation of an S3 data lake.
Once you define your disparate data sources to AWS Lake Formation, it crawls your data sources and moves the data into your S3 data lake.
It uses machine learning algorithms to clean and classify your data.
This is the simplest and most efficient option listed.
Reference:
Please see the AWS Lake Formation overview page, the Amazon EMR overview page, the AWS Big Data blog titled Build a Data Lake Foundation with AWS Glue and Amazon S3, and the Amazon Kinesis overview page.
The most efficient and expeditious way to create a centralized data source with cleaned and classified data from multiple sources is to use AWS Glue. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to move data between data stores. It has built-in capabilities for discovering data sources, extracting data, transforming data, and loading data into a variety of data stores, including S3 data lake.
Option A suggests using Amazon EMR and Apache Spark MLlib to extract, transform, and load data into an S3 data lake. While Apache Spark is a powerful open-source framework for distributed data processing, it requires significant expertise to set up, configure, and manage. Additionally, EMR can be costly, especially for smaller workloads. Therefore, AWS Glue is a more cost-effective and efficient option for a centralized data source.
Option B recommends using AWS Glue crawlers to crawl data sources and create a metastore for S3 data lake. Once the metastore is created, Glue can extract, transform, and load data into S3. AWS Glue crawlers can automatically discover and classify data stored in various data stores and help reduce the time and effort required to manage data ingestion pipelines. Therefore, it is a viable option for creating a centralized data source.
Option C suggests using Amazon Kinesis Data Firehose to send data from different sources to an S3 data lake. Lambda integration can be used to transform data as it is loaded into S3. However, this option may not be ideal for cleaning and classifying data as it relies heavily on Lambda functions, which can increase the overall cost of the solution.
Option D recommends using AWS Lake Formation to collect, catalog, transform, and load data into S3. AWS Lake Formation provides a secure and scalable way to build and manage a data lake. However, this option may not be as expeditious as AWS Glue as it involves setting up a data lake from scratch, which can be time-consuming.
In summary, the most expeditious and efficient way to create a new centralized data source with cleaned and classified data from multiple sources is to use AWS Glue. It simplifies the process of discovering, extracting, transforming, and loading data into a data store such as S3. It also offers cost-effective pricing and requires minimal expertise to set up and manage.