Extract Tables from PDF in Python

We are going to learn about how to extract tables from PDF in Python. While programming in many cases, we need to work with table data. But if they are in the PDF, we need to extract them first.

We will discuss two easy ways to extract tables from PDF in Python. For one we will use ‘Tabulate’ and for the other one, we will use ‘Camelot’.

How to extract tables from PDF in Python

It is easy to code in Python, as we can use inbuilt functions, packages, and many more.

We will show here two methods using inbuilt functions and packages.

Assume that we have the table in the PDF given below:

Sl.  Name  RollNo.  Dept
1    Ana    011     CSE
2    Ram    012     CSE
3    Joe    014     EE
4    Ken    024     ME
5    Ben    035     CE

This PDF is saved as ‘CodeSpeedy.pdf’. It contains the table of students’ serial numbers, names, roll numbers, and department datasets.

We can extract these tables in many ways in Python. We will discuss two ways.

 

Using Tabulate: Extract tables from PDF

First, we need to install tabula-py and tabulate to extract PDF in Python.

You can use this command given below:

pip install tabula-py
pip install tabulate

Then users can use the code below:

from tabula import read_pdf
from tabulate import tabulate

tables = read_pdf("CodeSpeedy.pdf",pages="all")
print(tabulate(tables))

At first, we will import the necessary packages. then read the pdf and extract the tables from it.

Here, tabulate rearranges the data from the table, and read_pdf extracts the data from the tables in the PDF.

 

Using Camelot

We need to install Camelot-py to extract PDF in Python.

You can use the command below:

pip install camelot-py

By using Camelot code:

import camelot
 
tables = camelot.read_pdf("CodeSpeedy.pdf")
 
print(tables[0].df)

At first, we will import the camelot package. Then read the pdf file and extract the tables from it.

Here, read_pdf extracts the data from the tables and tables[ind].df indicates the table in the PDF.

 

These are some popular methods to extract tables from PDF in Python.

I hope it will be useful.

Thank you!

Also read:
Check if a string exists in a PDF

Leave a Reply

Your email address will not be published. Required fields are marked *