Extracting Title and headers of a Web Page using BeautifulSoup in Python
Hello Coder! In this article, we are going to learn how to extract the title of a web page using BeautifulSoup in Python.
What is BeautifulSoup?
BeautifulSoup is a Python library that is generally used for extracting data from files like HTML and XML. It takes HTML or XML file in the form of a string as a parameter parses the file and then creates a corresponding data structure.
We should download and install BeautifulSoup Library as it is not pre-installed. It can be done by running the following command in your console.
pip install beautifulsoup4
How to extract title and headers of a Web Page using BeautifulSoup in Python
We need to import urllib.request module in order to use urlopen() method to open the URL as well as to return a file object from which we should retrieve data of the website.
We also need to import BeautifulSoup from bs4 and re module.
Here we require re module to get all the headers of a web page.
Then we store the URL of the website into a variable called URL. In this article I’m using URL of our website.
We now open the URL using urlopen() and store the returned reference file to a variable fileObj.
Once we have opened the URL we have to read the fileObj using read() method and store it in html variable in byte code.
After reading the fileObj as a byte code, we need to parse it using BeautifulSoup to return the parsed HTML and store to a variable called soup.
We can now extract the title of the web page as soup.title. It gives us HTML code of the title. It can be converted as readable text using soup.title.text.
Now let us extract headers of the website using regular expressions.
BeautifulSoup objects provide us find_all() to return list of tags of the HTML code.
We can use re.compile(‘^h[1-6]’) in find_all() method to get list of tags of all headers of the HTML code.
Using for loop to iterate over all the header tags, we can print every header.
from urllib.request import urlopen from bs4 import BeautifulSoup import re url = "https://www.codespeedy.com/" fileObj = urlopen(url) html = fileObj.read() soup = BeautifulSoup(html, 'html.parser') print("Title : ",soup.title.text) headerlist = soup.find_all(re.compile('^h[1-6]')) print("Headers : ") for header in headerlist: print(header.text)
Hurrah! we have learned how to extract headers and titles from a web page.
Thanks for reading this article. I hope it helped somehow. Also, Do check out our other related articles below :