Read a Particular Page from a PDF File in Python
After reading this tutorial you will be able to read a Particular Page from a PDF File in Python. We use PyPDF2 Module for reading a Particular Page from a PDF File in Python. PyPDF2 is not a pre-defined Package. So, we have to install it by proceeding with the following command in your Command Prompt (cmd).
C:\Users\...\Python\Scripts> pip install PyPDF2
Then, PyPDF2 Package will be installed. PyPDF2 consists of various Classes. But, we need only PdfFileReader Class to read a PDF File. So, this can be imported as follows
from PyPDF2 import PdfFileReader as R
How to Read a Particular Page from a PDF File in Python
Here, PdfFileReader Class is imported as R (i.e. R=PdfFileReader). As we know, without opening a File, we can’t read data from it. So, let’s have a look at Opening a PDF file.
Opening a File:
f=open("Path_to_your_PDF_File","rb")
Where, f is a File Object that holds your PDF File which is located at Specified Path (i.e. Path_to_your_PDF_File). Open() is a Builtin Function that opens a Specified File in Specified Mode (i.e. “rb”). rb is the combination of Reading Mode and Binary Mode. So f opens the given PDF File in Binary Readable Format.
To know more about File Reading Formats Click Here ->Introduction to file handling of python
So, we have to create an object for PdfFileReader Class (i.e. R) as follows
pdf=R(f)
From the above, pdf is the PdfFileReader Object which reads PDF Files. It consists of a list (i.e. pages) which holds the Page Objects for each page.
i.e. pdf.pages=[ PO1, PO2, PO3, … , POn]
where, PO1 to POn are the Page Objects of “n” Pages of given PDF File. pdf.pages[0] returns the Page Object of Page 1 i.e. PO1, pdf.pages[1] returns the Page Object of Page 2 i.e. PO2 and so on.
Each Page Object has various methods. But, we need only extractText() Method to extract the Text from that page. Let’s have a look at the following code to read a Particular Page from a PDF File in Python.
Example:
from PyPDF2 import PdfFileReader as R f=open("Path_to_your_PDF_File","rb") pdf=R(f) page_no=2 # I have selected 3rd Page to display its Contents P_O=pdf.pages[page_no] # Since Pages starts counting from '0' print(P_O.extractText()) f.close()
From the above Python Script,
- f is the File Object
- pdf is the PdfFileReader Object
- page_no is the Number of the existing Page in PDF File
- P_O is the Corresponding Page Object for given Page Number
Input:
A Sample PDF File -> PDF_sample.pdf
Output:
The output of the above code will be as follows
In this way, we can read a Particular Page from the given PDF File using Python.
For further References, Please refer Watermark on PDF
Leave a Reply