Read meta description and title tag of a web page in Python
You may learn a lot about a website’s content, the underlying content strategy, or product lines by scraping the titles and meta descriptions from all of its pages. It’s helpful to understand some fundamental web scraping techniques to collect this valuable information, whether you’re looking at your own website or your rivals’ websites.
In this project, I’ll demonstrate how to use web scraping to build a basic scraper in Python that uses urllib and Beautiful Soup to collect data from all website pages.
Scrape a site’s page titles and meta descriptions
Python offers a variety of methods for carrying out web scraping activities. I’d strongly advise using Scrapy for bigger projects because it supports threading and is more efficient. Requests, urllib, and BeautifulSoup are acceptable for smaller applications, such as scraping your own website.
Let’s Create a small program to understand this concept.
Step 1: Import necessary modules
import urllib.request from urllib.parse import urlparse from bs4 import BeautifulSoup
Step 2: Take the URL of the website from the user and use urlopen()
it from urllib.request
to Open the URL and extract the HTML source code with the help of html.parser method.
response = urllib.request.urlopen("https://www.satyug.edu.in/") soup = BeautifulSoup(response,'html.parser', from_encoding=response.info().get_param('charset')) print(soup)
Step 3: Parse the metadata
Using Beautiful Soup’s findall() method, we’ll get every element with the meta name=”description” attribute and extract its content.
if soup.findAll("meta", attrs={"name": "description"}): print(soup.find("meta", attrs={"name": "description"}).get("content")) else: print("error")
Step 4: Fetch the site title
The title element may now be extracted from the page by parsing the HTML soup again using Beautiful Soup. The title string for each page of code looked at is returned by this.
if soup.findAll("title"): print(soup.find("title").string) else: print("error")
So our final code will be:
import urllib.request from urllib.parse import urlparse from bs4 import BeautifulSoup response = urllib.request.urlopen("https://www.satyug.edu.in/") soup = BeautifulSoup(response,'html.parser', from_encoding=response.info().get_param('charset')) #print(soup) if soup.findAll("meta", attrs={"name": "description"}): print("Meta Data is :") print(soup.find("meta", attrs={"name": "description"}).get("content")) else: print("error") if soup.findAll("title"): print("Title is: ") print(soup.find("title").string) else: print("error")
The output will be:
Meta Data is : The best engineering and management college in Faridabad and Delhi, NCR. Highly qualified faculty is dedicated towards the cause of honing the next generation of engineers and managers. Title is: SDIET Process finished with exit code 0
I hope you like this article.
Thanks
Amandeep Singh
Leave a Reply