How to Read PDF File in Python Line by Line?
You may have gone through various examples of text file handling, in which you must have written text into the file or extracted it from the file as a whole (using ‘read()’ function) or line by line (using ‘readline()’ or ‘readlines()’ function). And here, we do not need to import any external library also, it is built-in in different versions of Python.
But in the case of working with PDF files is a bit different. We may need to work with PDF files to perform various Natural Language Processing tasks or for any other purpose. By default, Python does not come with any of the built-in libraries that can help us to read and write PDF files. Therefore, we need to use an external library known as ‘PyPDF’ (its recent version is PyPDF4 but we will be using PyPDF2).
PyPDF is completely an independent library. That means, it runs on every Python platform without any dependency on any other external library support. PyPDF is capable of Extracting Document Information, Splitting Documents, Merging Documents, Cropping Pages in PDF, Encrypting and Decrypting, etc.
Reading PDF File Line by Line
Before we get into the code, one important thing that is to be mentioned is that here we are dealing with Text-based PDFs (the PDFs generated using word processing), because Image-based PDF needs to be handled with a different library known as ‘pyTesseract’. It doesn’t means that it can’t be handled with PyPDF, but there is a disadvantage of using this is that we need to change its encoding and convert it into text-based PDF, which would result in loss of data. Hence, it is not advisable to do so. Instead, we would cover this topic of Image-based PDFs in some other article.
So, Let’s get started, our first task is to install PyPDF library.
$ pip3 install PyPDF2
Now its turn for the actual code, But one Important thing to understand is that there is no direct method in PyPDF library to read PDF file line by line, it always read it as a whole (using ‘extractText()’ function), but one good thing to knew, that it always returns the ‘String’ as an output.
So, here we need to find some similarity in the separation of each and every line in the whole PDF document. Here I had used a sample PDF file (mypdf), in this each line is separated by a bunch of blank spaces, so I have found my way of splitting the lines (using ‘split()’ function) with two blank spaces as a parameter. There might be PDF files in which lines would be separated by ‘\n’, so you can use this as a parameter for ‘split()’ function.
Now below is our Python program to read the PDF file line by line:
# Importing required modules import PyPDF2 # Creating a pdf file object pdfFileObj = open('mypdf.pdf','rb') # Creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # Getting number of pages in pdf file pages = pdfReader.numPages # Loop for reading all the Pages for i in range(pages): # Creating a page object pageObj = pdfReader.getPage(i) # Printing Page Number print("Page No: ",i) # Extracting text from page # And splitting it into chunks of lines text = pageObj.extractText().split(" ") # Finally the lines are stored into list # For iterating over list a loop is used for i in range(len(text)): # Printing the line # Lines are seprated using "\n" print(text[i],end="\n\n") # For Seprating the Pages print() # closing the pdf file object pdfFileObj.close()
As you can see, each page content is showing in the console.
I hope this article would be fruitful to you, ‘Keep Learning Keep Coding’.