How to parse HTML in Python
In this post, we will learn how to parse HTML (Hypertext Markup language) in Python. Parsing is a technique of examining web text which is the combination of different tags, tokens, etc.
For parsing the HTML content of a webpage in Python we will use a Python module known as BeautifulSoup. So before we begin the tutorial we must have to install the prerequisites.
- pip install requests
- pip install beautifulsoup4
Also read: Python string rjust() and ljust() methods
Parse HTML in Python
Beautiful Soup is a library that is used to scrape the data from web pages. It is used to parse HTML and XML content in Python.
First of all import the requests module and the BeautyfulSoup module from bs4 as shown below.
import requests from bs4 import BeautifulSoup # Url of website url="https://www.codespeedy.com" rawdata=requests.get(url) html=rawdata.content
Now we will use html.parser to parse the content of html and prettify it using BeautifulSoup.
# Parsing html content with beautifulsoup soup = BeautifulSoup(html, 'html.parser') print(soup)
Once the content is parsed using we can use different methods of beautiful soup to get the relevant data from the website.
print(soup.title) paragraphs = soup.find_all('p') print(paragraphs)
Combining the whole code at a place.
import requests from bs4 import BeautifulSoup # Url of website url="https://www.codespeedy.com" rawdata=requests.get(url) html=rawdata.content # Parsing html content with beautifulsoup soup = BeautifulSoup(html, 'html.parser') print(soup.title) paragraphs = soup.find_all('p') print(paragraphs)
If you have any queries related to this post feel free to ask us in the comment section of this post. If you want a post on any topic in Python comment below your topic name.
Also read: What is Metaclass in Python