Web Scraping using lxml in Python

In this tutorial, we will be performing web scraping using lxml in Python. Web Scraping is the process of scraping or retrieving information/data from different websites. Most of the websites have a basic structure of HTML elements and also contains CSS(Cascading Style Sheet) for styling.

Web Scraping using lxml

Steps to perform web scraping using lxml:

  1. Send a link you want to scrape and get the response from the sent link.
  2. Conversion to a byte string from the response object.
  3. In the lxml module, we pass the byte string to the ‘from string’ method in the HTML class.
  4. XPath is used to get to certain data on the website.
  5. Scraped Data can be used as per need.

Importing the Modules for web scraping

import requests
import lxml.html

If you do not have the requests module installed type the below code in your Command Prompt (Windows) or Terminal(Mac or Linux).

pip install requests

We will use the requests module to request for the website which we want to extract.

website = requests.get('https://store.steampowered.com/explore/new/')
document = lxml.html.fromstring(website.content)

Now we will write the code for the division which contains the ‘Popular New Releases’ tab. We will write an Xpath for extracting this information.

new_releases = document.xpath('//div[@id="tab_newreleases_content"]')[0]

This code will provide us with a list of all the divisions in the HTML page which has an id of tab_newreleases_content. Now we have the required divisions which contain the new releases.

We will not extract the title and the price of the new_releases with the following block of code:

game_titles = releases_new.xpath('.//div[@class="tab_item_name"]/text()')
game_prices = releases_new.xpath('.//div[@class="discount_final_price"]/text()')

If you want you can extract the tags and the platform of these games.

We will not create an empty list that will display which will contain the title and the price of each game. Each of these games and titles will be inside a dictionary.

result = []
for information in zip(game_titles,game_prices):
    out = {}
    out['game_titles'] = information[0]
    out['game_prices'] = information[1]
    result.append(out)

print(result[0:3])

The output of the above code will be:

[{'game_titles': 'Fae Tactics', 'game_prices': '₹ 529'}, {'game_titles': 'Karnage Chronicles', 'game_prices': '₹ 557'}, {'game_titles': 'Hellpoint', 'game_prices': '₹ 759'}]

Python implementation of lxml

import requests
import lxml.html

website = requests.get('https://store.steampowered.com/explore/new/')
document = lxml.html.fromstring(website.content)

releases_new = document.xpath('//div[@id="tab_newreleases_content"]')[0]

game_titles = releases_new.xpath('.//div[@class="tab_item_name"]/text()')
game_prices = releases_new.xpath('.//div[@class="discount_final_price"]/text()')

result = []
for information in zip(game_titles,game_prices):
    out = {}
    out['game_titles'] = information[0]
    out['game_prices'] = information[1]
    result.append(out)

print(result[0:3])

You are free to check some of these articles related to Web Scraping in Python:

Scrap COVID-19 data using BeautifulSoup in Python
Python: COVID-19 live update for india

Leave a Reply

Your email address will not be published. Required fields are marked *