News category prediction using Natural language processing [NLP]
In this tutorial, we will work on the news articles dataset and categorize the articles based on the content. So let’s learn how to predict news category using NLP (Natural Language Processing) with Python.
The dataset we will be using is:
News category prediction using NLP Dataset zip file – Download Dataset
News category prediction using NLP in Python using scikit-learn
First, we will start by importing the required libraries:
%matplotlib inline import re import matplotlib import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score from sklearn.multiclass import OneVsRestClassifier from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) from sklearn.svm import LinearSVC from nltk.stem.porter import PorterStemmer from nltk.stem import WordNetLemmatizer,LancasterStemmer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline import seaborn as sns from keras import utils as np_utils from keras.preprocessing import sequence from keras.preprocessing.text import Tokenizer from keras.models import Sequential,Model from keras.layers import Dense, Dropout, Activation from keras.layers import Embedding,Input,LSTM from keras.layers import Conv1D, GlobalMaxPooling1D import tensorflow as tf from sklearn.externals import joblib from textblob import TextBlob from keras.optimizers import RMSprop,Adam from keras.callbacks import EarlyStopping
Importing the dataset
df = pd.read_excel("Data_Train.xlsx")
The article may contain many repetitive words like a the, and many other prepositions, connectors. Because these words repeat very often, we need to remove these words. For this, we write a function to clean the articles. Cleaning includes removal of punctuation marks, stop words. The text is converted to the lowercase so that there is no confusion among the uppercase and lowercase words. Lemmatization that involves grouping together different inflected forms of word so that they can be analyzed as a single term. It also involves removing the apostrophes.
def clean_text(text): text = text.lower() text = re.sub(r"what's", "what is ", text) text = re.sub(r"\'s", " ", text) text = re.sub(r"\'ve", " have ", text) text = re.sub(r"can't", "can not ", text) text = re.sub(r"n't", " not ", text) text = re.sub(r"i'm", "i am ", text) text = re.sub(r"\'re", " are ", text) text = re.sub(r"\'d", " would ", text) text = re.sub(r"\'ll", " will ", text) text = re.sub(r"\'scuse", " excuse ", text) text = re.sub('\W', ' ', text) text = re.sub('\s+', ' ', text) text = text.strip(' ') text=re.sub('[^a-zA-Z]',' ',text) text=text.lower() text=text.split() lemmatizer = WordNetLemmatizer() text=[lemmatizer.lemmatize(word) for word in text if not word in set(stopwords.words('english'))] text=' '.join(text) return text
We now apply this method to the text in the data frame in order to get the relevant information.
df['STORY']=df['STORY'].map(lambda story:clean_text(story))
Now, we will split the dataset into training and test sets so that we can train the model and validate it on the test set:
train, test = train_test_split(df, random_state=42, test_size=0.2) x_train = train.STORY x_test = test.STORY y_train=train.SECTION y_test=test.SECTION
After splitting, we create a matrix based on the frequency of the word in the content.
vectorizer=TfidfVectorizer(max_df=0.9,min_df=1,stop_words='english') train_vectors=vectorizer.fit_transform(x_train) test_vectors=vectorizer.transform(x_test) total_vectors=vectorizer.transform(df['STORY'])
Creating a classifier to categorize the articles:
from sklearn.neural_network import MLPClassifier mlp=MLPClassifier() mlp.fit(train_vectors,y_train) mlp_prediction=mlp.predict(test_vectors) accuracy_score(y_test, mlp_prediction)
Output: It shows the accuracy of the model
from sklearn.neural_network import MLPClassifier mlp=MLPClassifier() mlp.fit(train_vectors,y_train) Out[8]: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant', learning_rate_init=0.001, max_iter=200, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False) mlp_prediction=mlp.predict(test_vectors) accuracy_score(y_test, mlp_prediction)
Out[9]: 0.9796854521625163
Leave a Reply