AWS Glue for Building Data Schema for K-means Clustering of Ride Data

Using AWS Glue for Optimal Data Schema Creation for K-means Clustering of Ride Data

Question

You work in the data analytics department of a ride sharing software company.

You need to use the K-means machine learning algorithm to separate your company's optimized ride data into clusters based on ride coordinates.

How would you use AWS Glue in the best way to build the data schema needed to classify the ride data?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: A.

Option Ais correct.

The best way to build the schema for your data is to use a Glue crawler that leverages a classifier or multiple classifiers.

(See the AWS Glue crawler https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html documentation).

Answer B is incorrect because there is no stated need to remove duplicates from the data.

Option C is incorrect because you don't need to automatically generate code since Glue will generate your schema for your data based on a prioritized list of classifiers without custom code (See the AWS Glue developers guide: (https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html).

Option D is incorrect because there is no stated requirement to flatten the ride data.

Reference:

For an example, please see the AWS Machine Learning blog post titled Serverless unsupervised machine learning with AWS Glue and Amazon Athena: https://aws.amazon.com/blogs/machine-learning/serverless-unsupervised-machine-learning-with-aws-glue-and-amazon-athena/.

As a data analytics department of a ride-sharing software company, you need to classify ride data based on ride coordinates using the K-means machine learning algorithm. To build the required data schema, you can leverage AWS Glue, a fully managed extract, transform, and load (ETL) service that makes it easy to move data between data stores.

Out of the given options, the most appropriate choice would be to use Glue crawlers to crawl your ride share data (Option A).

Glue crawlers can scan various data stores, including Amazon S3, relational databases, and NoSQL databases, to infer schema information and extract metadata such as table definitions, column names, and data types. Crawlers automatically create and update table definitions in the AWS Glue Data Catalog, which can be used by other AWS services, including Amazon EMR and Amazon Athena, to process and analyze your data.

Here's how you can use Glue crawlers to build the required data schema:

  1. Configure a Glue crawler: First, you need to create a Glue crawler and configure it to point to the location where your ride share data is stored. You can choose from various data sources, including Amazon S3, relational databases, and NoSQL databases.

  2. Run the Glue crawler: Once you've configured the Glue crawler, you can run it to extract metadata and schema information from your ride share data. The Glue crawler will analyze your data and create or update table definitions in the AWS Glue Data Catalog.

  3. Use the AWS Glue Data Catalog: Once the Glue crawler has completed its job, you can use the AWS Glue Data Catalog to access your ride share data and build the required data schema for the K-means algorithm. You can also use other AWS services such as Amazon SageMaker to build and deploy your machine learning model based on the classified ride data.

In summary, using Glue crawlers is the best option to build the data schema needed to classify the ride data based on coordinates for the K-means algorithm. Glue crawlers help automatically create and update table definitions in the AWS Glue Data Catalog and make it easy to access data from various data stores.