Named Entity Recognition using spaCy in Python

In this tutorial, we will learn to identify NER(Named Entity Recognition). It is the very first step towards information extraction in the world of NLP. It locates and identifies entities in the corpus such as the name of the person, organization, location, quantities, percentage, etc. Today we are going to build a custom NER using Spacy.

Spacy is an open-source library that intelligently extracts valuable information without wasting time with few lines of code. And it is also more efficient than preprocessing the textual data from scratch and building the model. Spacy is trained on OntoNotes 5, it supports various types of entities it is more efficient than the previous techniques. Spacy can create sophisticated models for various NLP problems

Named Entity Recognition using spaCy

Let’s install Spacy and import this library to our notebook.

!pip install spacy
!python -m spacy download en_core_web_sm

spaCy supports 48 different languages and has a model for multi-language as well.

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

The spacy model creates the Spacy object or Pipeline. This object contains the language-specific vocabulary, model weights, and processing pipeline like tokenization rules, stop words, POS rules, etc.

So here we are defining the spacy object with the variable name nlp.

nlp = en_core_web_sm.load()

For instance, let’s create our own document

doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
output:
[('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]

spaCy pipeline

Because spacy is composed of pipelines such as tokenizer, tagger, parser, etc. so, When we process some text with nlp object, it creates a doc object, And Token objects represent the word tokens in the document. Thus we can get the tokens by simply indexing the doc object.

Since the NER is the first step for information extraction, using spacy we can easily extract the Named Entities from the text.

print([(z.text, z.label_) for z in doc.ents])

Now let’s look at the token level entity recognition which is similar to the above entity level except token level uses BILUO tagging scheme to describe the entity boundaries.

print([(z, z.ent_iob_, z.ent_type_) for z in doc])

Similarly, we will extract named entities from a New York Times article.

For this, I am scraping the article using BeautifulSoup.

from bs4 import BeautifulSoup
import requests
import re

def url_2_text(url):
    result = requests.get(url)
    html = result.text
    parser = BeautifulSoup(html, 'html5lib')
    for script in parser(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', parser.get_text()))

ny_news = url_2_text('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news')
article = nlp(ny_news)
len(article.ents)

Let’s visualize named entities present in the text data.

token_labels = [z.label_ for z in article.ents]
Counter(token_labels)

To get the most common entities in the corpus.

entities = [z.text for z in article.ents]
Counter(entities).most_common(4)

Now its time to visualize named entities present in the corpus.

sentences = [z for z in article.sents]
print(sentences[25])
displacy.render(nlp(str(sentences[25])), jupyter=True, style='ent')
output:
A spokeswoman for the F.B.I. ORG did not respond to a message seeking comment about why Mr. Strzok PERSON was dismissed rather than demoted.

Leave a Reply

Your email address will not be published. Required fields are marked *