Build deep neural network for custom NER with Keras

In this tutorial, we are going to learn to identify NER (Named Entity Recognition). It is the very first step towards information extraction in the world of NLP. It is one of the most common problems that is used for locating and identifying entities in the corpus such as the name of the person, organization, location, quantities, percentage, etc.

Today we are going to build a custom NER using deep Neural Network for custom NER with Keras Python module. For this problem we are going to use the Bi-LSTM layer and CRF layer which are predefined in the Keras library. The model will then be trained on labeled data and evaluate test data.

Custom NER using Deep Neural Network with Keras in Python

Named Entity Recognition is thought of as a subtask of information extraction that is used for identifying and categorizing the key entities from a text. The entities can be the name of the person or organization, places, brands, etc. For example “Codespeedy” in a text can be classified as a Company and so on.

So let’s get to the implementation of NER now…

First, we are going to import some important libraries.

import numpy as np 
import pandas as pd
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
import keras as k
from keras_contrib.layers import CRF
from sklearn.metrics import f1_score, classification_report

For this particular problem, I have loaded the dataset from Kaggle.

so let’s import our dataset in our data frame,

df = pd.read_csv("ner.csv",encoding = "ISO-8859-1", error_bad_lines = False, index_col = 0)
df.head()

Next, select only the necessary columns.

we will select “senence_idx”, “word”, and “tag” variables for our problem.

data = df[['sentence_idx','word','tag']]
data.head()

The data frame has a predefined label for each word from each sentence, therefore, we need to first group our data frame using “sentence_idx” and create a list of lists of tuples.

For example :

[('Thousands', 'O'), ('of', 'O'), ('demonstrators', 'O')]

Therefore we are going to create a class “SentenceGetter” that we will use to get our output

class SentenceGetter(object):
    
    def __init__(self, dataset):
        self.n_sent = 1
        self.dataset = dataset
        self.empty = False
        agg_func = lambda s: [(w, t) for w,t in zip(s["word"].values.tolist(), s["tag"].values.tolist())]
        self.grouped = self.dataset.groupby("sentence_idx").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped[self.n_sent]
            self.n_sent += 1
            return s
        except:
            return None
          
getter = SentenceGetter(data)
sentences = getter.sentences

Let’s redefine our tag values and then create dictionaries of word2idx (word to index) and tag2idx (tag to index)

tags = []
for tag in set(data['tag'].values):
    if tag is nan or isinstance(tag, float):
        tags.append('unk')
    else:
        tags.append(tag)

words = list(set(data['word']))
num_words = len(words)
num_tags = len(tags)
from future.utils import iteritems
word2idx = {w:i for i,w in list(enumerate(words))}
tag2idx = {t:i for i,t in list(enumerate(tags))}
idx2tag = {v: k for k, v in iteritems(tag2idx)}

Next, we will use the above dictionaries for mapping each word and tag in a sentence to a number because our model only understands numeric representation.

But first, it needs to define the maximum length of the sequence. After that we use the pad_sequence method to pad ‘0’ for a sentence less than sequence length. By default, padding=’pre’.

maxlen = max([len(s) for s in sentences])

X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=maxlen, sequences=X)

y = [[tag2idx[w[1]] for w in s] for s in sentences]
y = pad_sequences(maxlen=maxlen, sequences=y)

y = [to_categorical(i, num_classes=num_tags) for i in y]

In our case, the sequence length is 140. Then dummify the tag values which is our target variable. Here we need to classify the tags hence, we are using the “to_categorical” method for dummification. Thus we need to define the number of classes which is our number of tags (num_tags).

Now that our data is ready to be trained

Split data into train and test using the following code

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


Let’s build our Neural Network for NER…

First of all, we will use the embedding layer to get the vector representation of all the words. Here we have 150 dimension representations of words. The input is a sequence of integers which represent certain words and the embedding layer transforms each word into a 150 dimension vector.

input = Input(shape=(140,))
word_embedding_size = 150

# Embedding Layer
model = Embedding(input_dim=num_words, output_dim=word_embedding_size, input_length=140)(input)

On top of the embedding layer, we are going to add the Bi-Lstm layer. The Bi-LSTM layer expects a sequence of words as input. LSTM layer normally injects sequence in the forward direction. However, the Bi-LSTM layer takes input in the forward direction as well as backward direction thus improving the predicting capability of our NER model.

model = Bidirectional(LSTM(units=word_embedding_size, 
                           return_sequences=True, 
                           dropout=0.5, 
                           recurrent_dropout=0.5, 
                           kernel_initializer=k.initializers.he_normal()))(model)
model = LSTM(units=word_embedding_size * 2, 
             return_sequences=True, 
             dropout=0.5, 
             recurrent_dropout=0.5, 
             kernel_initializer=k.initializers.he_normal())(model)

–> click on Bi-LSTM and LSTM to know more about them in Python using Keras

Now let’s add TimeDistributed layer to the architecture. It is a kind of wrapper that applies a layer to every temporal slice of the input. As the LSTM layers return output for each timestep rather than a single value because we have specified “return_sequence = True”. Hence TimeDistributed layer can apply a dense layer effectively for each hidden state output.

# TimeDistributed Layer
model = TimeDistributed(Dense(num_tags, activation="relu"))(model)

At last, we will add the CRF layer to the architecture to correctly predict the labels i.e. tags in our case. (NER)

crf = CRF(num_tags)

out = crf(model)  # output
model = Model(input, out)

Now that we have designed our architecture. It’s time to compile our model.

#Optimiser 
adam = k.optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999)

# Compile model
model.compile(optimizer=adam, loss=crf.loss_function, metrics=[crf.accuracy, 'accuracy'])

You can use

model.summary()

to view the architecture of the model.

Let’s fit the model on training data. We will use callbacks using the ModelCheckpoint() method to save the best model.

# Saving the best model only
filepath="ner-bi-lstm-td-model-{val_accuracy:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

# Fit the best model
history = model.fit(X_train, np.array(y_train), batch_size=256, epochs=20, validation_split=0.1, verbose=1, callbacks=callbacks_list)
test_pred = model.predict(X_test, verbose=1) 

pred_labels = [[idx2tag[np.argmax(i)] for i in p] for p in test_pred]
test_labels = [[idx2tag[np.argmax(i)] for i in p] for p in y_test]

after fitting on train data let’s predict test data. Then transform back the index to their respective tags using the previously defined “idx2tag” dictionary.

test_pred = model.predict(X_test, verbose=1) 

pred_labels = [[idx2tag[np.argmax(i)] for i in p] for p in test_pred]
test_labels = [[idx2tag[np.argmax(i)] for i in p] for p in y_test]

Use the following code :

from  sklearn_crfsuite.metrics import flat_classification_report  
report = flat_classification_report(y_pred=pred_labels, y_true=test_labels)
print(report)

to know the accuracy, f1_score, recall, and precision of the custom NER model.

Here we achieved 99% accuracy both in training and test data.

Sp we have successfully build a deep neural network for custom NER with Keras.

Leave a Reply