For any query, contact us at
+91-9872993883
+91-8283824812
info@ris-ai.com

☰

AI Demos Blog Thesis Services Pricing Contact Us Know More

Most Viewed Articles

Blogs >
Email spam Detection

Email spam Detection with Machine Learning¶

In this Data Science Project I will show you how to detect email spam using Machine Learning technique called Natural Language Processing and Python. ¶

So this program will detect if an email is spam (1) or not (0) ¶

Import the libraries : ¶

In [1]:

import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string

Load the data and print the first 5 rows :¶

In [3]:

df = pd.read_csv("data.csv")
df.head()

Out[3]:

	text	spam
0	Subject: naturally irresistible your corporate...	1
1	Subject: the stock trading gunslinger fanny i...	1
2	Subject: unbelievable new homes made easy im ...	1
3	Subject: 4 color printing special request add...	1
4	Subject: do not have money , get software cds ...	1

Now let’s explore the data and get the number of rows & columns : ¶

In [4]:

df.shape

Out[4]:

(5728, 2)

To get the column names in the data set :¶

In [5]:

df.columns

Out[5]:

Index(['text', 'spam'], dtype='object')

To check for duplicates and remove them :¶

In [6]:

df.drop_duplicates(inplace=True)
print(df.shape)

(5695, 2)

To see the number of missing data for each column : ¶

In [7]:

print(df.isnull().sum())

text    0
spam    0
dtype: int64

Now Download the stop words¶

Stop words in natural language processing, are useless words (data). ¶

In [8]:

# download the stopwords package
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/webtunix/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[8]:

True

Now Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuation and then removing the useless words also known as stop words. ¶

In [9]:

def process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)

    clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return clean
# to show the tokenization
df['text'].head().apply(process)

Out[9]:

0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

Now convert the text into a matrix of token counts : ¶

In [10]:

from sklearn.feature_extraction.text import CountVectorizer
message = CountVectorizer(analyzer=process).fit_transform(df['text'])

Now we need to split the data into training and testing sets, and then we will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value. ¶

In [11]:

#split the data into 80% training and 20% testing
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(message, df['spam'], test_size=0.20, random_state=0)
# To see the shape of the data
print(message.shape)

(5695, 37229)

Now we need to create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features. ¶

In [12]:

# create and train the Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(xtrain, ytrain)

To see the classifiers prediction and actual values on the data set : ¶

In [13]:

print(classifier.predict(xtrain))
print(ytrain.values)

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]

Now let’s see how well our model performed by evaluating the Naive Bayes classifier and the report, confusion matrix & accuracy score. ¶

In [14]:

# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtrain)
print(classification_report(ytrain, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: \n", accuracy_score(ytrain, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3457
           1       0.99      1.00      0.99      1099

    accuracy                           1.00      4556
   macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556


Confusion Matrix: 
 [[3445   12]
 [   1 1098]]
Accuracy: 
 0.9971466198419666

It looks like the model used is 99.71% accurate. Let’s test the model on the test data set (xtest & ytest) by printing the predicted value, and the actual value to see if the model can accurately classify the email text. ¶

In [15]:

#print the predictions
print(classifier.predict(xtest))
#print the actual values
print(ytest.values)

[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]

Now let’s evaluate the model on the test data set : ¶

In [16]:

# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtest)
print(classification_report(ytest, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytest, pred))
print("Accuracy: \n", accuracy_score(ytest, pred))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       870
           1       0.97      1.00      0.98       269

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139


Confusion Matrix: 
 [[862   8]
 [  1 268]]
Accuracy: 
 0.9920983318700615

Most Viewed Articles

Email spam Detection with Machine Learning¶

In this Data Science Project I will show you how to detect email spam using Machine Learning technique called Natural Language Processing and Python. ¶

So this program will detect if an email is spam (1) or not (0) ¶

Import the libraries : ¶

Load the data and print the first 5 rows :¶

Now let’s explore the data and get the number of rows & columns : ¶

To get the column names in the data set :¶

To check for duplicates and remove them :¶

To see the number of missing data for each column : ¶

Now Download the stop words¶

Stop words in natural language processing, are useless words (data). ¶

Now Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuation and then removing the useless words also known as stop words. ¶

Now convert the text into a matrix of token counts : ¶

Now we need to split the data into training and testing sets, and then we will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value. ¶

Now we need to create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features. ¶

To see the classifiers prediction and actual values on the data set : ¶

Now let’s see how well our model performed by evaluating the Naive Bayes classifier and the report, confusion matrix & accuracy score. ¶

It looks like the model used is 99.71% accurate. Let’s test the model on the test data set (xtest & ytest) by printing the predicted value, and the actual value to see if the model can accurately classify the email text. ¶

Now let’s evaluate the model on the test data set : ¶

The classifier accurately identified the email messages as spam or not spam with 99.2 % accuracy on the test data. ¶

Search Article

Popular ML Articles

Resources You Will Ever Need

Popular Searches

Go for Research

Consultation fee- 150 USD/hour

Select Thesis

Synopsis

Research Paper

Total cost (in USD): $0

PHD

Contact for custom package.

Most Viewed Articles

Email spam Detection with Machine Learning¶

In this Data Science Project I will show you how to detect email spam using Machine Learning technique called Natural Language Processing and Python. ¶

So this program will detect if an email is spam (1) or not (0) ¶

Import the libraries : ¶

Load the data and print the first 5 rows :¶

Now let’s explore the data and get the number of rows & columns : ¶

To get the column names in the data set :¶

To check for duplicates and remove them :¶

To see the number of missing data for each column : ¶

Now Download the stop words¶

Stop words in natural language processing, are useless words (data). ¶

Now Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuation and then removing the useless words also known as stop words. ¶

Now convert the text into a matrix of token counts : ¶

Now we need to split the data into training and testing sets, and then we will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value. ¶

Now we need to create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features. ¶

To see the classifiers prediction and actual values on the data set : ¶

Now let’s see how well our model performed by evaluating the Naive Bayes classifier and the report, confusion matrix & accuracy score. ¶

It looks like the model used is 99.71% accurate. Let’s test the model on the test data set (xtest & ytest) by printing the predicted value, and the actual value to see if the model can accurately classify the email text. ¶

Now let’s evaluate the model on the test data set : ¶

The classifier accurately identified the email messages as spam or not spam with 99.2 % accuracy on the test data. ¶

Don't forget to share this Article!

Sharing is Caring

Search Article

Popular ML Articles

Resources You Will Ever Need

Popular Searches

Go for Research

Consultation fee- 150 USD/hour

Select Thesis

Synopsis

Research Paper

Total cost (in USD): $0

PHD

Contact for custom package.