Scrape HTML Table from a Webpage or URL in Python
In this tutorial, we will learn how to scrape HTML tables on websites fetching relevant information. Using BeautifulSoup will make it a difficult task. We will use an alternative way of doing this task.
Here we will cover these:
- Scrape HTML table from URL.
- Scrape HTML table from the local HTML file.
Scraping is a process to fetch the data from a particular website which could be further used for many purposes like analysis. The data scraped is used for many purposes. It is also one difficult task when dealing with the beautiful soup library, so we will use another method for scraping.
Scrape HTML Table from URL
Choose a website whose tables need to be scraped.
Start by installing a library as-
pip install html-table-parser-python3
After that import, all the necessary libraries are needed that is urllib.request, pprint,html_table_parser.parser,pandas, and then open the URL that needs to be scraped, decode it with UTF-8. The scrap_table function is called that will scrape the table by passing the URL as the parameter. The output data is saved in xhtml which is then parsed in the parser. The data is fed with the help of feed function. Each row is converted into array elements and then we need a pandas framework to implement any task. So here we will use pprint(pretty print) which will present the output in a formatted manner.
import urllib.request from pprint import pprint from html_table_parser.parser import HTMLTableParser import pandas as pd def scrape_tables(url): req = urllib.request.Request(url=url) f = urllib.request.urlopen(req) return f.read() xhtml = scrape_tables('https://trends.builtwith.com/websitelist/Responsive-Tables').decode('utf-8') p = HTMLTableParser() p.feed(xhtml) pprint(p.tables[0]) print(pd.DataFrame(p.tables[0])
[['', 'Website', 'Location', 'Sales Revenue', 'Tech Spend', 'Social', 'Employees', 'Traffic', ''], ..... ..... ..... ['', 'wesleyan.edu', 'United States', '', '$5000+', '10,000+', '1,000+', 'Very High', ''],........
We have scraped the table from the website and the output contains the relevant information.
( I am not showing the whole output here as it would be too long).
Scrape a local HTML file
So as far now, we know how to scrape a table from any website. But what if we have an HTML file on our PC and we want to scrape that HTML table.
Here we start by creating an HTML file as-
<html> <style> table, th, td { border:2px solid black; } </style> <body> <h2>A basic HTML table</h2> <table style="width:90%"> <tr> <th>Name</th> <th>Age</th> <th>Gender</th> </tr> <tr> <td>Khushi</td> <td>20</td> <td>F</td> </tr> <tr> <td>Tina</td> <td>18</td> <td>F</td> </tr> <tr> <td>Atharv</td> <td>10</td> <td>M</td> </tr> </table> </body> </html>
You can create HTML files on notepad with an extension of HTML. You can try opening it on the browser to check if it works correctly.
You must get output like-
After that, we will do the same thing on the local HTML file as-
import urllib.request from pprint import pprint from html_table_parser.parser import HTMLTableParser import pandas as pd def scrape_tables(url): req = urllib.request.Request(url=url) f = urllib.request.urlopen(req) return f.read() xhtml = scrape_tables('file:///C:/Users/KHUSHI/Documents/Codespeedy/Table.html').decode('utf-8') p = HTMLTableParser() p.feed(xhtml) pprint(p.tables[0]) print(pd.DataFrame(p.tables[0]))
[['Name', 'Age', 'Gender'], ['Khushi', '20', 'F'], ['Tina', '18', 'F'], ['Atharv', '10', 'M']] 0 1 2 0 Name Age Gender 1 Khushi 20 F 2 Tina 18 F 3 Atharv 10 M
We have scraped the table from the website and also from the local HTML file and the output contains the relevant information. Thanks for reading this tutorial. I hope it helped the audience.
Leave a Reply