Newspaper article scraping and curation in Python

In this tutorial, we will learn newspaper article scraping and curation in Python. We will be using the newspaper3k module which is used for extracting articles from newspapers. Before writing the program, we will first need to install some of the packages. You need to write the following commands in your command prompt(Windows) or Terminal(Mac/Linux) users.

Newspaper article scraping and curation

Installing Modules

pip install newspaper3k
pip install nltk
pip install lxml
pip install Pillow

The ‘newspaper3k’ module is for Python 3.x version users. If you are working in the ‘Python 2.x‘ version and want to import the same module just replace the command pip install newspaper3k with:

pip install newspaper

Importing the modules

Now we will need to import 2 of the modules which we installed in the previous steps:

import nltk
from newspaper import Article

nltk.download('punkt')

Both of these modules will help us to extract the information from the article and curate it. Now we will provide the URL of the news article that we want to extract.

website = "https://www.wsj.com/articles/pickup-trucks-are-getting-huge-got-a-problem-with-that-11596254412"

Now we will create an object of the ‘Article’ class that we imported from the ‘newspaper’ module. We will use the ‘download’ and ‘parse’ function to download and parse the article respectively.

The ‘nlp’ function is also used on the object to perform natural language processing on the news article.

news_article = Article(website)
news_article.download()
news_article.parse()
news_article.nlp()

Now we can extract and print all the various kinds of data that are present in the newspaper article such as the author of the publication, Date, Keywords, and much more.

Here, I have printed a few of the things related to the newspaper article.

print("The authors of this newspaper article is/are: ")
print(news_article.authors)

print("Date of Article Publication:")
print(news_article.publish_date)

print ("Article Keywords")
print(news_article.keywords)

print("Artice Image:")
print(news_article.top_image)

print("Summary of the Article:")
print(news_article.summary)

The output of the above code is:

Newspaper article scraping and curation in Python

Entire Code

import nltk
from newspaper import Article

nltk.download('punkt')

website = "https://www.wsj.com/articles/pickup-trucks-are-getting-huge-got-a-problem-with-that-11596254412"

news_article = Article(website)
news_article.download()
news_article.parse()

news_article.nlp()

print("The authors of this newspaper article is/are: ")
print(news_article.authors)

print("Date of Article Publication:")
print(news_article.publish_date)

print ("Article Keywords")
print(news_article.keywords)

print("Artice Image:")
print(news_article.top_image)

print("Summary of the Article:")
print(news_article.summary)

Here we also extracted the image of the newspaper article we scraped. Once you have executed the code,  inn your output section if possible you can press ‘Ctrl’ and click on the link of the image.

Image of the Article

Newspaper article scraping and curation in Python

If you want you can read these articles related to Web Scraping:

Scrap COVID-19 data using BeautifulSoup in Python

Scraping the data of webpage using xpath in scrapy

Leave a Reply

Your email address will not be published. Required fields are marked *