Data Sanitization and Preparation for Word2Vec Algorithm | AWS Certified Machine Learning - Specialty Exam

Data Sanitization and Preparation for Word2Vec Algorithm

Question

You are a machine learning specialist at a company that is exploring conversational user interface application development.

As an experiment, your team is building a natural language processing application.

Your application needs to process the transcribed conversation data from your conversational user interface.

For training, you are starting with a dataset comprising 5 million sentences.

You plan to run a model based on the Word2Vec algorithm to generate embeddings of the sentences.

This will allow your team to make different types of predictions. Based on this example sentence: “My funy LARGE MEME went over the audiences head.” Which operations should your team perform to sanitize and prepare the data in a repeatable manner? (CHOOSE THREE)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

Answers: B, C and F.

Option A is incorrect.

In natural language processing, the spelling of words has a relatively lower bearing on the importance of the word.

Option B is CORRECT.

Using normalization, you change all text so that it is on the same level.

For example, converting all characters to lowercase.

This allows algorithms like Bag Of Words and Word2Vec to perform more accurately.

Option C is CORRECT.

Removing stop words like not, nor, never, etc., which are the most common words in a given language.

Removing these words allows your algorithm to focus on differentiating words.

Option D is incorrect.

One-hot-encoding is a technique used to encode categorical data.

Option E is incorrect.

Using part-of-speech tagging to keep only the action verbs and nouns, you would strip the conversation data of much of its meaning.

The example sentence would become: Meme went head.

Option F is CORRECT.

NLP algorithms like Word2Vec work best with tokenized data as their input.

Reference:

Please see the article titled Natural Language Processing: Text Data Vectorization.

Please refer to the Towards Data Science article titled NLP: Extracting the main topics from your dataset using LDA in minutes.

Please see the Towards Data Science article titled NLP Text Preprocessing: A Practical Guide and Template.

Please see the Towards Data Science article titled 3 basic approaches in Bag of Words which are better than Word Embeddings.

Please see the Towards Data Science article titled Treat Negation Stopwords Differently According to Your NLP Task.

Please see the Machine Learning Mastery article titled Why One-Hot Encode Data in Machine Learning?.

Please see the article titled How to get started with Word2Vec - and then how to make it work.

To prepare and sanitize the data in a repeatable manner, the team should perform several operations. The three best options for this particular scenario are:

A. Correct the spelling of "funy" to "funny" and “audiences” to “audience's.” This operation is important because it corrects spelling errors and grammatical mistakes, which can affect the accuracy of the NLP model. Correcting the spelling of "funy" to "funny" and “audiences” to “audience's” will help ensure that the model interprets the text correctly.

B. Perform normalization by making the sentence lowercase. Normalization is the process of transforming text into a standard form to reduce variations in the data. Making the sentence lowercase ensures that the model is not affected by different cases of the same word. For example, without normalization, "My" and "my" would be treated as two separate words, leading to decreased accuracy. By making the sentence lowercase, the team ensures that each word is standardized and consistent.

F. Perform tokenization of the sentence, creating a word vector. Tokenization is the process of breaking down text into smaller units, or tokens, such as words or phrases. Tokenization is important for NLP because it allows the model to understand the structure of the text and identify the relationships between the words. By creating a word vector, the team can convert the text data into a numerical representation that can be used by the Word2Vec algorithm.

Therefore, the three operations that the team should perform to sanitize and prepare the data in a repeatable manner are correcting spelling errors and grammatical mistakes, making the sentence lowercase, and performing tokenization of the sentence to create a word vector. Using an English stopword dictionary to remove all stop words, using One-hot encoding on the sentence, and using part-of-speech tagging to keep the action verbs and the nouns only are also useful techniques, but not the best options for this particular scenario.