Named Entity Recognition using spaCy in Python
In this tutorial, we will learn to identify NER(Named Entity Recognition). It is the very first step towards information extraction in the world of NLP. It locates and identifies entities in the corpus such as the name of the person, organization, location, quantities, percentage, etc. Today we are going to build a custom NER using Spacy.
Spacy is an open-source library that intelligently extracts valuable information without wasting time with few lines of code. And it is also more efficient than preprocessing the textual data from scratch and building the model. Spacy is trained on OntoNotes 5, it supports various types of entities it is more efficient than the previous techniques. Spacy can create sophisticated models for various NLP problems
Named Entity Recognition using spaCy
Let’s install Spacy and import this library to our notebook.
!pip install spacy !python -m spacy download en_core_web_sm
spaCy supports 48 different languages and has a model for multi-language as well.
import spacy from spacy import displacy from collections import Counter import en_core_web_sm
The spacy model creates the Spacy object or Pipeline. This object contains the language-specific vocabulary, model weights, and processing pipeline like tokenization rules, stop words, POS rules, etc.
So here we are defining the spacy object with the variable name
nlp = en_core_web_sm.load()
For instance, let’s create our own document
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
output: [('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]
Because spacy is composed of pipelines such as tokenizer, tagger, parser, etc. so, When we process some text with
nlp object, it creates a doc object, And Token objects represent the word tokens in the document. Thus we can get the tokens by simply indexing the doc object.
Since the NER is the first step for information extraction, using spacy we can easily extract the Named Entities from the text.
print([(z.text, z.label_) for z in doc.ents])
Now let’s look at the token level entity recognition which is similar to the above entity level except token level uses BILUO tagging scheme to describe the entity boundaries.
print([(z, z.ent_iob_, z.ent_type_) for z in doc])
Similarly, we will extract named entities from a New York Times article.
For this, I am scraping the article using BeautifulSoup.
from bs4 import BeautifulSoup import requests import re def url_2_text(url): result = requests.get(url) html = result.text parser = BeautifulSoup(html, 'html5lib') for script in parser(["script", "style", 'aside']): script.extract() return " ".join(re.split(r'[\n\t]+', parser.get_text())) ny_news = url_2_text('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region®ion=top-news&WT.nav=top-news') article = nlp(ny_news) len(article.ents)
Let’s visualize named entities present in the text data.
token_labels = [z.label_ for z in article.ents] Counter(token_labels)
To get the most common entities in the corpus.
entities = [z.text for z in article.ents] Counter(entities).most_common(4)
Now its time to visualize named entities present in the corpus.
sentences = [z for z in article.sents] print(sentences)
displacy.render(nlp(str(sentences)), jupyter=True, style='ent')
output: A spokeswoman for the F.B.I. ORG did not respond to a message seeking comment about why Mr. Strzok PERSON was dismissed rather than demoted.