How to fetch text from a PDF and store in a .txt file in Java

In this tutorial, we will look at how to fetch text from a PDF file and then store it in a text file using java. To do this we will use an open-source Java library known as PDFBox.

Steps to get started with PDFBox:

  1. Download the latest version of PDFBox JAR from this link: https://pdfbox.apache.org/download.html.
  2. Launch Eclipse and create a new Java project.
  3. Right-click on the project name and then click on the Build path option.
  4. Next, select Configure build path.
  5. Go to libraries and click on Add External JARs.
  6. Select the downloaded PDFBox JAR file.
  7. Click on apply and close.

Fetching text from a PDF file

To fetch text from a PDF file we can follow the following steps:

  1. Loading the PDF file
    We can use the LoadPDF() method from the Loader class. This method takes the object of the file class as the parameter. So we create an object of the file class by passing the PDF file path as the parameter.

    //Loading the pdf file into PDDocument
    File MyFile = new File(String FilePath);
    PDDocument MyPDF = Loader.loadPDF(MyFile);
  2. Initializing The PDFTextStripper class
    We have to create an instance of the PDFTextStripper class to extract text from a PDF file. This can be done in the following way:

    //Initialising The PDFTextStripper class
    PDFTextStripper TextStripper = new PDFTextStripper();
  3. Extracting the text
    To extract text from a PDF file we use the getText() method from the PDFTextStripper class. We can use this method in the following way:

    //Fetching the text from the pdf
    String text = TextStripper.getText(MyPDF);
  4. Closing the PDF file
    After we have extracted the text we can use the close() method to close the PDDocument class object.

    //Closing the PDF file
    MyPDF.close();
  5. Initializing the FileWriter Class
    Next, to write the extracted text to a text file we are going to use the FileWriter class. The constructor of this class accepts the text file path as a string.

    //useing FileWriter to open the text file and writing the text to it
    FileWriter textfile = new FileWriter(String FilePath);
  6. Writing text to the text file
    To write the text to the text file we use the write() method. This method is from the FileWriter class and it accepts a string object as its parameter. This method writes the string provided in the parameter to the text file.

    //Writing text to the text file
    textfile.write(text);
  7. Closing the Text file
    We use the close() method from the FileWriter class to close the text file.

    //Closing the Text file
    textfile.close();

 

Now, combining all the concepts from above here is a sample code that fetches text from a PDF file and stores it in a Text file.

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class PDFtotxt {
  public static void main(String args[]) {
    try {
      //Loading the pdf file into PDDocument
      File MyFile = new File(String FilePath);
      PDDocument MyPDF= Loader.loadPDF(MyFile);
      //Initializing The PDFTextStripper class
      PDFTextStripper TextStripper = new PDFTextStripper();
      //Fetching the text from the pdf
      String text = TextStripper.getText(MyPDF);
      //use FileWriter to open the text file and write the text to it
      FileWriter textfile = new FileWriter(String FilePath);
      //Writing text to the text file
      textfile.write(text);
      textfile.close();
      MyPDF.close();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

Sample input PDF file:

How to fetch text from a PDF and store in a .txt file in Java

Output:

The output text file will look like this:

How to fetch text from a PDF and store in a .txt file in Java

Here is a related tutorial to learn more about PDFBox: How to generate PDF invoice using Java

Leave a Reply

Your email address will not be published. Required fields are marked *