Extracting Title and headers of a Web Page using BeautifulSoup in Python

Hello Coder! In this article, we are going to learn how to extract the title of a web page using BeautifulSoup in Python.

What is BeautifulSoup?

BeautifulSoup is a Python library that is generally used for extracting data from files like HTML and XML. It takes HTML or XML file in the form of a string as a parameter parses the file and then creates a corresponding data structure.

We should download and install BeautifulSoup Library as it is not pre-installed. It can be done by running the following command in your console.

pip install beautifulsoup4

How to extract title and headers of a Web Page using BeautifulSoup in Python

We need to import urllib.request module in order to use urlopen() method to open the URL as well as to return a file object from which we should retrieve data of the website.

We also need to import BeautifulSoup from bs4 and re module.
Here we require re module to get all the headers of a web page.

Then we store the URL of the website into a variable called URL. In this article I’m using URL of our website.

We now open the URL using urlopen() and store the returned reference file to a variable fileObj.

Once we have opened the URL we have to read the fileObj using read() method and store it in html variable in byte code.

After reading the fileObj as a byte code, we need to parse it using BeautifulSoup to return the parsed HTML and store to a variable called soup.

We can now extract the title of the web page as soup.title. It gives us HTML code of the title. It can be converted as readable text using soup.title.text.

Now let us extract headers of the website using regular expressions.

BeautifulSoup objects provide us find_all() to return list of tags of the HTML code.

We can use re.compile(‘^h[1-6]’) in find_all() method to get list of tags of all headers of the HTML code.
Using for loop to iterate over all the header tags, we can print every header.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = "https://www.codespeedy.com/"
fileObj = urlopen(url)
html = fileObj.read()
soup = BeautifulSoup(html, 'html.parser')
print("Title : ",soup.title.text)
headerlist = soup.find_all(re.compile('^h[1-6]'))
print("Headers : ")
for header in headerlist:
    print(header.text)

Output :

Title : Programming Blog and Software Development Company - CodeSpeedy
Headers :
CodeSpeedy - Coding Solution & Software Development
Some of Our Programming Blog Categories
Java
PHP
Python
JavaScript
WordPress
CSS
jQuery
Bootstrap
Services We Provide
Web Design & Development
Software Development
Mobile App Development
Artificial Intelligence
Python
PHP
JavaScript
Recent Blog Posts from CodeSpeedy
About CodeSpeedy Technology Private Limited

Hurrah! we have learned how to extract headers and titles from a web page.

Thanks for reading this article. I hope it helped somehow. Also, Do check out our other related articles below :

Web Scraping using lxml in

Scrap COVID-19 data using BeautifulSoup in Python

Leave a Reply

Your email address will not be published.