Phishing Email Detection with Machine Learning Techniques

Text Feature Engineering Techniques

Question

You work for the security department of your firm.

As part of securing your firm's email activity from phishing attacks, you need to build a machine learning model that analyzes incoming email text to find word phrases like “you're a winner” or “click here now” to find potential phishing emails.

Which of the following text feature engineering techniques is the best solution for this task?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: D.

Option A is incorrect.

The Orthogonal Sparse Bigram natural language processing algorithm creates groups of words and outputs the pairs of words that include the first word.

You are trying to classify an email as a phishing attack by having your model learn based on the presence of multi-word phrases in the email text, not pairs of words from the email text stream using the first word as the key.

Option B is incorrect.

Term Frequency-Inverse Document Frequency determines how important a word is in a document by giving weights to words that are common and less common in the document.

You are trying to classify an email as a phishing attack by having your model learn based on the presence of multi-word phrases in the email text.

You are not trying to determine the importance of a word or phrase in the email text.

Option C is incorrect.

The Bag-of-Words natural language processing algorithm creates tokens of the input document text and outputs a statistical depiction of the text.

The statistical depiction, such as a histogram, shows the count of each word in the document.

You are trying to classify an email as a phishing attack by having your model learn based on the presence of multi-word phrases in the email text, not individual words.

Option D is correct.

The N-Gram natural language processing algorithm is used to find multi-word phrases in the text, in this case, an email.

This suits your phishing detection task since you are trying to classify an email as a phishing attack by having your model learn based on the presence of multi-word phrases.

Reference:

Please see the article titled Introduction to Natural Language Processing for Text, and the article titled Document Classification Part 2: Text Processing (N-Gram Model & TF-IDF Model)

The best text feature engineering technique for identifying word phrases in email text is the Bag-of-Words (BoW) approach.

The Bag-of-Words technique creates a dictionary of all the unique words in the text corpus, and each word is assigned a numerical value based on its frequency of occurrence in the text. The resulting numerical representation of the text can be used for various machine learning tasks, including text classification.

In the context of identifying potential phishing emails, the BoW technique can be used to create a feature vector for each email that represents the frequency of occurrence of specific word phrases, such as "you're a winner" or "click here now." These word phrases can be added to the BoW dictionary as separate tokens.

Once the BoW feature vectors are created for each email, a machine learning algorithm can be trained to classify emails as either phishing or non-phishing based on the frequency of occurrence of the targeted word phrases.

The other techniques listed in the answer choices are less suitable for this task:

  • Orthogonal Sparse Bigram (OSB) is a variation of N-Gram, which aims to reduce the dimensionality of N-Gram models. However, it is not well-suited for identifying specific targeted word phrases in email text.
  • Term Frequency-Inverse Document Frequency (tf-idf) is a technique that assigns weights to words based on their frequency of occurrence in a document and their rarity in the corpus. While useful for identifying important words in a document, it is less suited for identifying targeted word phrases.
  • N-Gram is a technique that creates sequences of N contiguous words in a document. While N-Gram can be useful for capturing word sequences that are specific to phishing emails, it is less suited for identifying targeted word phrases.