Spam Detector: Regularized Natural Language Processing

less than 1 minute read

Implementation of an SMS spam detector based on logistic regression. This is the same idea behind sentiment analysis, but instead of predicting positive sentiment vs negative sentiment, you are going to predict whether a SMS text is spam or not.

The initial dataset looked as follows:

Labeling

Encoded the type column to be 1 for spam and 0 for ham and store the result in sms_spam2_df. After the processing, the label column was converted to binary.

TF-IDF and Feature Engineering

Created a pipeline that combines a Tokenizer, CounterVectorizer, and an IDF estimator to compute the tfidf vectors of each SMS. This pipeline was fitted and assigned a variable tfidf_pipeline. The Tokenizer step created a column for words, the CounterVectorizer step created a column tf, and the IDF step created a column tfidf.

Results after model implementation

Logistic regression using PySpark was implemented with various settings of regularization parameter ( $\lambda$ ) and elastic net mixture ( $\alpha$ )

Positive words:

Negative words:

Twitter Facebook Google+ LinkedIn

Advait Ramesh Iyer

Spam Detector: Regularized Natural Language Processing

You May Also Enjoy

Dynamic newsvendor model for Optimistic and Pessimistic policy-based profit forecasting

Primer for Linear Algebra

Character Recognition: Can Machine Learning Identify Human Written Characters?

Analysis of co-purchased products on Amazon