Count the number of words in a PDF file in Python
Hello programmers, in this tutorial, we will learn how to count the number of words in a PDF file in Python.
For counting the numbers of words in a PDF, we are using the PyPDF2 module library of python, which is an extended version of pypdf module of python.
- At first, we have to install this library in our system
# Installation of PyPDF2 library pip install PyPDF2
- Now we have to import this library
- Then we use the PdfFileReader function of PyPDF2 and give them the path of the file which we open for read “rb” of our pdf file to read.
- To count the number of pages, we use the numPages function.
import PyPDF2 file= open("C:\\Users\\sumit\\..files\\2.pdf", 'rb') ReadPDF = PyPDF2.PdfFileReader(path) pages = ReadPDF.numPages print(pages)
- Now to count the number of words, we will create a variable and set them equal to zero, and later, we store the number of words in that.
- After that, we have to create a for loop to extract text from each page of the pdf, so for this, we are using the extractText function.
- At last, we count the words on each page, store them in the variable we initially defined, and print them using the print function.
TWords = 0 for i in range(pages): pageObj = ReadPDF.getPage(i) text = pageObj.extractText() TWords+=len(text.split()) print (TWords)
Hopefully, you have learned how to count the number of words in a PDF file in Python.