Convert an HTML table to pandas Dataframe

Using Python, let us understand how to convert an HTML table to a pandas data frame. HTML provides us with <table> tag for storing data in table format. Pandas library has read_html() function to import data to data frames.

read_html() function

  • This function is used to read tables of an HTML file as Pandas data frames.
  • We can read a local file as well as a file from the internet through URL.

Reading tables from a file

Consider an HTML file called ‘table.html’ containing a table as follows,

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8">
  <title>Table Data</title>
</head>

<body>
  <table>
    <thead>
      <tr>
        <th>Full Name</th>
        <th>Position</th>
        <th>Salary</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Bill Gates</td>
        <td>Founder MIcrosoft</td>
        <td>$1000</td>
      </tr>
      <tr>
        <td>Steve Jobs</td>
        <td>Founder Apple</td>
        <td>$1200</td>
      </tr>
      <tr>
        <td>Mark Zuckerberg</td>
        <td>Founder Facebook</td>
        <td>$1300</td>
      </tr>
    </tbody>
  </table>
</body>
</html>
  • Pandas needs another library called ‘lxml’ for parsing HTML and XML files. So, install ‘lxml’ by executing this command.
pip install lxml
  • Now, we are ready to use the function read_html(). We can get any number of tables into dataframes by indexing.

Below python code shows the usage of the function:

import pandas as pd
tables = pd.read_html('table.html')
print("Display table")
df = tables[0]
print(df)

Output:

Display table
         Full Name           Position Salary
0       Bill Gates  Founder MIcrosoft  $1000
1       Steve Jobs      Founder Apple  $1200
2  Mark Zuckerberg   Founder Facebook  $1300

Reading tables from a URL

Similar to reading tables from an HTML file, we can also read tables from an HTML webpage using this function. In this case, we are going to provide the URL of the webpage.

For example,

import pandas as pd
tables = pd.read_html('https://www.w3schools.com/html/html_tables.asp')
print('Tables found:', len(tables))
df1 = tables[0]
print('First Table')
print(df1.head())

Output:

Tables found: 2
First Table
                        Company          Contact  Country
0           Alfreds Futterkiste     Maria Anders  Germany
1    Centro comercial Moctezuma  Francisco Chang   Mexico
2                  Ernst Handel    Roland Mendel  Austria
3                Island Trading    Helen Bennett       UK
4  Laughing Bacchus Winecellars  Yoshi Tannamuri   Canada

You may also learn,

Leave a Reply

Your email address will not be published.