Natural Language Processing and its implementation in Python

In the previous tutorial on Natural Language Processing, I have discussed the basic introduction and intuition behind  NLP. In this tutorial, I am going to discuss the implementation of Natural Language Processing on basic problem in python.

Problem Statement:

There is a file (.tsv format file) which contains thousands of reviews for a restaurant. We have to classify the review is good or bad according to words present in the review.

This is the link to file that I have used Restaurant_Reviews.tsv.

Natural Language Processing and its implementation :

So, this is a step by step guide to basic application of NLP i.e. text processing in python.

  • First, import the data set on which we have to apply the text processing. Now since I have a .tsv file, I have taken delimiter as “\t”.
       data = pd.read_csv('Restaurant_Reviews.tsv', quoting = 3, delimiter = '\t')

Restaurant Reviews

  • Next step is to clean the data set that we have taken. Data cleaning involves removing stopwords, stemming,  etc. First of all, I have imported all the necessary libraries. Next, we make an array corpus. Then a loop for a thousand reviews is there. Under this for loop, sentence is broken into words. Then for each word in review, we will check if it is stopword or not. For the former case, we will append the word in the review. Basically what we are doing is removing stop words from the review. These words are then put into corpus array.
       import re
       import nltk'stopwords')
       from nltk.corpus import stopwords
       from nltk.stem.porter import PorterStemmer
       corpus = []
       for i in range(0, 1000):
           reviews = re.sub('[^a-zA-Z]', ' ', data['Review'][i])
           reviews = review.lower()
           reviews = review.split()
           porter = PorterStemmer()
           review = [porter.stem(word) for word in reviews if not word in set(stopwords.words('english'))]
           reviews = ' '.join(reviews)

Restaurant Reviews Python

  • Next step is creating a bag of words model. This model basically consists of all unique words present in reviews. This is our complete data set after the preprocessing step.
       from sklearn.feature_extraction.text import CountVectorizer
       cv = CountVectorizer(max_features = 1500)
       X = cv.fit_transform(corpus).toarray()
       y = data.iloc[:, 1].values
  • Next, we have to split our data set into training and testing sets.
       from sklearn.cross_validation import train_test_split
       X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

Restaurant Reviews 3 Python

Restaurant Reviews Python

  •  Apply algorithm of your choice to fit the model. For the sake of example, I took naive bayes algorithm to fit the model.
  •   from sklearn.naive_bayes import GaussianNB
      classifier = GaussianNB(), y_train)
  • Next step is applying the model to the testing set to predict the results.
       y_pred = classifier.predict(X_test)

Restaurant Reviews Python 5

  • Next step is to evaluate the performance of your model i.e. examining how much is it predicting accurately and how much wrong.
       from sklearn.metrics import confusion_matrix
       cm = confusion_matrix(y_test, y_pred)

Restaurant Reviews Python 5

Here, I would like to end this tutorial. If you have any doubts, feel free to post your doubts in the comments section.

If you are not clear with this code, read the intuition behind NLP which is here Introduction to Natural Language Processing- NLP

Also, give a read to Understanding Support vector machine(SVM)


Leave a Reply

Your email address will not be published. Required fields are marked *