Count the number of words in a PDF file in Python

Hello programmers, in this tutorial, we will learn how to count the number of words in a PDF file in Python.

For counting the numbers of words in a PDF, we are using the PyPDF2 module library of python, which is an extended version of pypdf module of python.

let’s start

  • At first, we have to install this library in our system
# Installation of PyPDF2 library
pip install PyPDF2
output:2
  • Now we have to import this library
  • Then we use the PdfFileReader function of PyPDF2 and give them the path of the file which we open for read “rb” of our pdf file to read.
  • To count the number of pages, we use the numPages function.
import PyPDF2
file= open("C:\\Users\\sumit\\..files\\2.pdf", 'rb')
ReadPDF = PyPDF2.PdfFileReader(path)
pages = ReadPDF.numPages
print(pages)
  • Now to count the number of words, we will create a variable and set them equal to zero, and later, we store the number of words in that.
  • After that, we have to create a for loop to extract text from each page of the pdf, so for this, we are using the extractText function.
  • At last, we count the words on each page, store them in the variable we initially defined, and print them using the print function.
TWords = 0
for i in range(pages):
    pageObj = ReadPDF.getPage(i)
    text = pageObj.extractText()
    TWords+=len(text.split())

print (TWords)
output:83

Hopefully, you have learned how to count the number of words in a PDF file in Python.

Leave a Reply

Your email address will not be published. Required fields are marked *