• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Fake News Detection Using Machine Learning

Project Objective: In today world it become difficult to find out that the news which come infornt us is real or not due to misleading element in society. So to find out that the given news is real or fake we can find out it with the help of machine learning to detect news orignality. If not such news items may contain false and/or exaggerated claims, and may end up being viralized by algorithms, and users may end up in a filter bubble.

Passive Aggressive Classifier:

Passive Aggressive Classifier belongs to the category of online learning algorithms in machine learning. It works by responding as passive for correct classifications and responding as aggressive for any miscalculation. Passive Aggressive Classifier is an online learning algorithm where you train a system incrementally by feeding it instances sequentially, individually or in small groups called mini-batches. Simply put, it remains passive for correct predictions and responds aggressively to incorrect predictions. Now let’s see how to implement the aggressive passive classifier using the Python programming language.

Importing the libraries

Start this task by importing the necessary Python libraries:

In [1]:
import numpy as np
import pandas as pd
import itertools

Importing the dataset

Now we read CSV file name fake_or_real_news.csv. We will use this dataset to try and predict news given is real or fake. It contain 3 columns i.e. id, title, text and label(tell news is fake or real) and 20800 columns i.e. number of entries.

In [2]:
#Read the data
df=pd.read_csv('news.csv')
#Get shape and head
df.shape
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 3 columns):
id       20800 non-null int64
title    20242 non-null object
label    20800 non-null object
dtypes: int64(1), object(2)
memory usage: 487.6+ KB
Out[2]:
id title label
0 0 House Dem Aide: We Didn’t Even See Comey’s Let... 1
1 1 FLYNN: Hillary Clinton, Big Woman on Campus - ... 0
2 2 Why the Truth Might Get You Fired 1
3 3 15 Civilians Killed In Single US Airstrike Hav... 1
4 4 Iranian woman jailed for fictional unpublished... 1

Get the labels from the DataFrame.

If value of label is 1 then it's mean that news is real if the value of label is 0 then news is fake.

In [3]:
#Get the labels
labels=df.label
labels.head()
Out[3]:
0    1
1    0
2    1
3    1
4    1
Name: label, dtype: object

Execute the following code to divide our data into training and test sets

In [4]:
from sklearn.model_selection import train_test_split
#Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['title'].values.astype('U'), labels, test_size=0.2)

Fit and transform the vectorizer on the train set, and transform the vectorizer on the test set.

Initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are to be filtered out before processing the natural language data. And a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.

TfidfVectorizer

TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.

IDF (Inverse Document Frequency)

Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
#Initialize a TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)

Initialize a PassiveAggressiveClassifier

We’ll fit this on tfidf_train and y_train. Then, we’ll predict on the test set from the TfidfVectorizer and calculate the accuracy with accuracy_score() from sklearn.metrics.

In [6]:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score
#Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
Out[6]:
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=50, n_iter_no_change=5,
                            n_jobs=None, random_state=None, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

Predict the label for news and find accuracy

We got an accuracy of approx 92% with this model. Finally, let’s print out a confusion matrix to gain insight into the number of false and true negatives and positives.

In [7]:
#Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print('Acurracy:',score)

from sklearn.metrics import confusion_matrix
# Build confusion matrix
data=confusion_matrix(y_test,y_pred)
Acurracy: 0.9225961538461539
Now let's compare some of our predicted values with the actual values and see how accurate we are:

Now, we have the y_pred which are the predicted values from our Model and y_test which are the actual values. Let us compare are see how well our model did. As you can see from the screenshot below - our basic model did pretty well.

In [8]:
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
Out[8]:
Actual Predicted
19643 0 0
713 1 1
17459 1 1
7145 0 0
19021 1 1
6132 1 1
15590 0 1
549 0 1
2239 0 0
5496 0 0
4576 0 0
19009 1 1
4358 1 1
14958 0 0
2252 1 1
11183 0 0
20275 1 1
19753 0 0
9275 1 1
189 1 1
17820 0 0
7875 0 0
13868 0 0
20623 0 0
11065 0 0
1997 0 0
1629 0 0
10534 1 1
2180 1 1
689 1 1
... ... ...
3943 1 1
9561 0 0
14727 0 0
17368 0 0
4736 1 1
19724 0 0
12463 1 0
17524 1 1
9893 1 0
11086 1 1
2043 1 1
3390 1 1
7176 0 0
3931 0 0
19954 0 0
18146 0 0
8108 0 0
9441 1 1
16158 0 1
19078 0 0
12413 1 1
4224 0 0
9240 0 0
1878 1 1
19288 1 1
17272 1 1
6029 0 0
1103 1 1
4612 0 0
19297 0 0

4160 rows × 2 columns

Conclusion:

We learned to detect fake news with Python. We took a political dataset, implemented a TfidfVectorizer, initialized a PassiveAggressiveClassifier, and fit our model. We ended up obtaining an accuracy of approx 92% in magnitude.