Extracting Text from a Pdf file in Python
In this article, I am going to let you know how to extract text from a PDF file in Python.
Before diving into the topic, a lot of things need to be configured.
The pdftotext module is used as the main component to extract text.
Steps to install the required modules :
- Open the command line or the terminal based on your operating system.
- Install the pdftotext Python library with the pip using the command below:
pip install pdftotext
- If any error raises then follow the steps below
- Reopen the terminal and type sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
- Now, follow the second step to get all the required files installed.
Hope, that the above steps are clear and you have installed everything.
Certainly, to check whether it is installed or not, follow the steps.
import pdftotext
Therefore, after writing this piece of code run it, and most probably there should be no error if every step is followed correctly.
Also, read:
Extract Text from a PDF file in Python:
The pdf file is first opened in RB mode which means the file is read in binary mode.
import pdftotext pdf_file = open("/home/gvj861/Downloads/Vth.pdf" , "rb") # opening a pdf file stored in the system
After that, it is converted into text format or extracts the text from the pdf by using pdftotext.
import pdftotext pdf_file = open("/home/gvj861/Downloads/Vth.pdf" , "rb") gvj_pdf = pdftotext.PDF(pdf_file) # using the above imported module
Certainly, there are different ways now to read the data.
Process -1 :
Iterating page by page and get the data through a for-loop.
import pdftotext pdf_file = open("/home/gvj861/Downloads/Vth.pdf" , "rb") gvj_pdf = pdftotext.PDF(pdf_file) for i in gvj_pdf: # iterating every page in pdf print(i) pdf_file.close()
Therefore, the whole pdf is read as in a text format.
Process – 2 :
A single page in a pdf can be read using the page number of the pdf file.
import pdftotext pdf_file = open("/home/gvj861/Downloads/Vth.pdf" , "rb") gvj_pdf = pdftotext.PDF(pdf_file) page_number = 4 # can be dynamically given by user print (gvj_pdf[page_number]) pdf_file.close()
This is how the extraction of text is done from a pdf file.
Concluding, even more, can be done the printed text can be written into a text file using the correct format.
Leave a Reply