Fake News Detection Using Machine Learning in Python
In this tutorial program, we will learn about building fake news detector using machine learning with the language used is Python. So here I am going to discuss what are the basic steps of this machine learning problem and how to approach it.
For fake news predictor, we are going to use Natural Language Processing (NLP).
Also, read: Credit Card Fraud detection using Machine Learning in Python
Importing Libraries
In Machine learning using Python the libraries have to be imported like Numpy, Seaborn and Pandas.
import numpy as np import pandas as pd import seaborn as sns import re import nltk from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import f1_score
The Dataset:
Here is the link to the Datasets: test.csv, train.csv
data_train = pd.read_csv("train.csv") print("Data shape = ",data_train.shape) data_train.head()
Output:
id | keyword | location | text | target | |
---|---|---|---|---|---|
0 | 1 | NaN | NaN | Our Deeds are the Reason of this #earthquake M... | 1 |
1 | 4 | NaN | NaN | Forest fire near La Ronge Sask. Canada | 1 |
2 | 5 | NaN | NaN | All residents asked to 'shelter in place' are ... | 1 |
3 | 6 | NaN | NaN | 13,000 people receive #wildfires evacuation or... | 1 |
4 | 7 | NaN | NaN | Just got sent this photo from Ruby #Alaska as ... | 1 |
Dropping the not required columns:
data_train = data_train.drop(['location','keyword'], axis=1) print("location and keyword columns droped successfully") location and keyword columns droped successfully data_train = data_train.drop('id', axis=1) print("id column droped successfully") data_train.columns
Output:
Out[7]:
Create corpus a feature of NLP:
corpus = [] pstem = PorterStemmer() for i in range(data_train['text'].shape[0]): #Remove unwanted words tweet = re.sub("[^a-zA-Z]", ' ', data_train['text'][i]) #Transform words to lowercase tweet = tweet.lower() tweet = tweet.split() #Remove stopwords then Stemming it tweet = [pstem.stem(word) for word in tweet if not word in set(stopwords.words('english'))] tweet = ' '.join(tweet) #Append cleaned tweet to corpus corpus.append(tweet) print("Corpus created successfully")
#Create our dictionary uniqueWordFrequents = {} for tweet in corpus: for word in tweet.split(): if(word in uniqueWordFrequents.keys()): uniqueWordFrequents[word] += 1 else: uniqueWordFrequents[word] = 1 #Convert dictionary to dataFrame uniqueWordFrequents = pd.DataFrame.from_dict(uniqueWordFrequents,orient='index',columns=['Word Frequent']) uniqueWordFrequents.sort_values(by=['Word Frequent'], inplace=True, ascending=False) uniqueWordFrequents.head(10)
Output:
Word Frequent | |
---|---|
co | 4746 |
http | 4721 |
like | 411 |
fire | 363 |
amp | 344 |
get | 311 |
bomb | 239 |
new | 228 |
via | 220 |
u | 216 |
uniqueWordFrequents['Word Frequent'].unique()
Output:
array([4746, 4721, 411, 363, 344, 311, 239, 228, 220, 216, 213, 210, 209, 201, 183, 181, 180, 178, 175, 169, 166, 164, 162, 156, 155, 153, 151, 145, 144, 143, 137, 133, 132, 131, 130, 129, 128, 125, 124, 123, 122, 121, 120, 119, 118, 117, 116, 114, 111, 110, 109, 108, 106, 105, 104, 103, 102, 101, 100, 99, 98, 97, 96, 95, 94, 93, 91, 90, 89, 88, 87, 86, 84, 83, 82, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
uniqueWordFrequents = uniqueWordFrequents[uniqueWordFrequents['Word Frequent'] >= 20] print(uniqueWordFrequents.shape) uniqueWordFrequents
Output:
Leave a Reply