AWS Certified Machine Learning - Specialty: Text Feature Engineering Techniques for Document Classification

Text Feature Engineering Techniques for Document Classification

Question

You work for a language translation software company.

Your company needs to move from traditional translation software to a machine learning model-based approach that produces the translations accurately.

One of your first tasks is to take text given in the form of a document and use a histogram to measure the occurrence of individual words in the document for use in document classification. Which of the following text feature engineering techniques is the best solution for this task?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: C.

Option A is incorrect.

The Orthogonal Sparse Bigram natural language processing algorithm creates groups of words and outputs the pairs of words that include the first word.

You are trying to measure the occurrence of individual words.

Option B is incorrect.

Term Frequency-Inverse Document Frequency determines how important a word is in a document by giving weights to words that are common and less common in the document.

You are not trying to determine the importance of the words in your document, just the count of the individual words.

Option C is correct.

The Bag-of-Words natural language processing algorithm creates tokens of the input document text and outputs a statistical depiction of the text.

The statistical depiction, such as a histogram, shows the count of each word in the document.

Option D is incorrect.

The N-Gram natural language processing algorithm is used to find multi-word phrases in the text of a document.

You are not trying to find multi-word phrases.

You are just trying to find the count of the individual words.

Reference:

Please see the article titled Introduction to Natural Language Processing for Text.

The best solution for the given task of measuring the occurrence of individual words in a document for use in document classification is "C. Bag-of-Words".

Text feature engineering is a process of extracting features from the raw text data to transform it into a structured format that can be used for machine learning tasks. The Bag-of-Words (BoW) technique is a commonly used text feature engineering method in natural language processing (NLP).

In the BoW approach, a document is represented as a collection of individual words, ignoring the order in which they appear but taking into account their frequency. The BoW technique builds a vocabulary of all the unique words in the corpus and then creates a feature vector for each document by counting the frequency of each word in the vocabulary.

For example, suppose we have the following two sentences:

  • "The quick brown fox jumps over the lazy dog."
  • "The lazy dog is sleeping."

The BoW representation of these two sentences would be as follows:

Thequickbrownfoxjumpsoverlazydogissleeping
Sentence 11111111100
Sentence 21000001111

As seen in the above table, each row represents a sentence, and each column represents a unique word in the corpus. The numbers in the table denote the frequency of the corresponding word in the respective sentence.

In the given task, the occurrence of individual words in the document is measured using a histogram, which can be constructed by using the BoW technique. The frequency count of each word in the vocabulary can be used to plot a histogram, which shows the distribution of words in the document.

Therefore, the correct answer is "C. Bag-of-Words." The other options are also text feature engineering techniques, but they are not suitable for the given task. Orthogonal Sparse Bigram (OSB) is a technique that considers pairs of adjacent words, while Term Frequency-Inverse Document Frequency (tf-idf) is a measure that reflects how important a word is in a document based on its frequency in the document and the corpus. N-Gram is a technique that considers sequences of adjacent words of length n, which is not appropriate for measuring the occurrence of individual words in the document.