Extracting Text from a Pdf file in Python

In this article, I am going to let you know how to extract text from a PDF file in Python.

Before diving into the topic, a lot of things need to be configured.

The pdftotext module is used as the main component to extract text.

Steps to install the required modules :

  • Open the command line or the terminal based on your operating system.
  •  Install the pdftotext Python library with the pip using the command below:
    pip install pdftotext
  •  If any error raises then follow the steps below
  •   Reopen the terminal and type sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev 
  • Now, follow the second step to get all the required files installed.

 

Hope, that the above steps are clear and you have installed everything.

Certainly, to check whether it is installed or not, follow the steps.

import pdftotext

Therefore, after writing this piece of code run it, and most probably there should be no error if every step is followed correctly.

Also, read:

Extract Text from a PDF file in Python:

The pdf file is first opened in RB mode which means the file is read in binary mode.

import pdftotext

pdf_file = open("/home/gvj861/Downloads/Vth.pdf" , "rb")  # opening a pdf file stored in the system

After that, it is converted into text format or extracts the text from the pdf by using pdftotext.

import pdftotext

pdf_file = open("/home/gvj861/Downloads/Vth.pdf" , "rb")

gvj_pdf = pdftotext.PDF(pdf_file) # using the above imported module

Certainly, there are different ways now to read the data.

Process -1 :

Iterating page by page and get the data through a for-loop.

import pdftotext

pdf_file = open("/home/gvj861/Downloads/Vth.pdf" , "rb")

gvj_pdf = pdftotext.PDF(pdf_file)


for i in gvj_pdf: # iterating every page in pdf
  print(i)

pdf_file.close()

Therefore, the whole pdf is read as in a text format.

Process – 2 :

A single page in a pdf can be read using the page number of the pdf file.

import pdftotext

pdf_file = open("/home/gvj861/Downloads/Vth.pdf" , "rb")

gvj_pdf = pdftotext.PDF(pdf_file)

page_number = 4  # can be dynamically given by user

print (gvj_pdf[page_number])

pdf_file.close()

This is how the extraction of text is done from a pdf file.

Concluding, even more, can be done the printed text can be written into a text file using the correct format.

Leave a Reply

Your email address will not be published. Required fields are marked *