Read meta description and title tag of a web page in Python

You may learn a lot about a website’s content, the underlying content strategy, or product lines by scraping the titles and meta descriptions from all of its pages. It’s helpful to understand some fundamental web scraping techniques to collect this valuable information, whether you’re looking at your own website or your rivals’ websites.

In this project, I’ll demonstrate how to use web scraping to build a basic scraper in Python that uses urllib and Beautiful Soup to collect data from all website pages.

Scrape a site’s page titles and meta descriptions

Python offers a variety of methods for carrying out web scraping activities. I’d strongly advise using Scrapy for bigger projects because it supports threading and is more efficient. Requests, urllib, and BeautifulSoup are acceptable for smaller applications, such as scraping your own website.

Let’s Create a small program to understand this concept.

Step 1: Import necessary modules

import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup

Step 2: Take the URL of the website from the user and use urlopen() it from urllib.request to Open the URL and extract the HTML  source code with the help of html.parser method.

response = urllib.request.urlopen("https://www.satyug.edu.in/")
soup = BeautifulSoup(response,'html.parser',
                         from_encoding=response.info().get_param('charset'))
print(soup)

Step 3: Parse the metadata

Using Beautiful Soup’s findall() method, we’ll get every element with the meta name=”description” attribute and extract its content.

if soup.findAll("meta", attrs={"name": "description"}):
    print(soup.find("meta", attrs={"name": "description"}).get("content"))
else:
    print("error")

Step 4: Fetch the site title

The title element may now be extracted from the page by parsing the HTML soup again using Beautiful Soup. The title string for each page of code looked at is returned by this.

if soup.findAll("title"):
    print(soup.find("title").string)
else:
    print("error")

So our final code will be:

import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup

response = urllib.request.urlopen("https://www.satyug.edu.in/")
soup = BeautifulSoup(response,'html.parser',
                         from_encoding=response.info().get_param('charset'))
#print(soup)
if soup.findAll("meta", attrs={"name": "description"}):
    print("Meta Data is :")
    print(soup.find("meta", attrs={"name": "description"}).get("content"))
else:
    print("error")

if soup.findAll("title"):
    print("Title is: ")
    print(soup.find("title").string)
else:
    print("error")

The output will be:

Meta Data is :
The best engineering and management college in Faridabad and Delhi, NCR. 
Highly qualified faculty is dedicated towards the cause of honing the next 
generation of engineers and managers.
Title is: 
SDIET

Process finished with exit code 0

I hope you like this article.

Thanks

Amandeep Singh

Leave a Reply

Your email address will not be published. Required fields are marked *