How to fetch text from a PDF and store in a .txt file in Java
In this tutorial, we will look at how to fetch text from a PDF file and then store it in a text file using java. To do this we will use an open-source Java library known as PDFBox.
Steps to get started with PDFBox:
- Download the latest version of PDFBox JAR from this link: https://pdfbox.apache.org/download.html.
- Launch Eclipse and create a new Java project.
- Right-click on the project name and then click on the Build path option.
- Next, select Configure build path.
- Go to libraries and click on Add External JARs.
- Select the downloaded PDFBox JAR file.
- Click on apply and close.
Fetching text from a PDF file
To fetch text from a PDF file we can follow the following steps:
- Loading the PDF file
We can use the LoadPDF() method from the Loader class. This method takes the object of the file class as the parameter. So we create an object of the file class by passing the PDF file path as the parameter.//Loading the pdf file into PDDocument File MyFile = new File(String FilePath); PDDocument MyPDF = Loader.loadPDF(MyFile);
- Initializing The PDFTextStripper class
We have to create an instance of the PDFTextStripper class to extract text from a PDF file. This can be done in the following way://Initialising The PDFTextStripper class PDFTextStripper TextStripper = new PDFTextStripper();
- Extracting the text
To extract text from a PDF file we use the getText() method from the PDFTextStripper class. We can use this method in the following way://Fetching the text from the pdf String text = TextStripper.getText(MyPDF);
- Closing the PDF file
After we have extracted the text we can use the close() method to close the PDDocument class object.//Closing the PDF file MyPDF.close();
- Initializing the FileWriter Class
Next, to write the extracted text to a text file we are going to use the FileWriter class. The constructor of this class accepts the text file path as a string.//useing FileWriter to open the text file and writing the text to it FileWriter textfile = new FileWriter(String FilePath);
- Writing text to the text file
To write the text to the text file we use the write() method. This method is from the FileWriter class and it accepts a string object as its parameter. This method writes the string provided in the parameter to the text file.//Writing text to the text file textfile.write(text);
- Closing the Text file
We use the close() method from the FileWriter class to close the text file.//Closing the Text file textfile.close();
Now, combining all the concepts from above here is a sample code that fetches text from a PDF file and stores it in a Text file.
import java.io.File; import java.io.FileWriter; import java.io.IOException; import org.apache.pdfbox.Loader; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; public class PDFtotxt { public static void main(String args[]) { try { //Loading the pdf file into PDDocument File MyFile = new File(String FilePath); PDDocument MyPDF= Loader.loadPDF(MyFile); //Initializing The PDFTextStripper class PDFTextStripper TextStripper = new PDFTextStripper(); //Fetching the text from the pdf String text = TextStripper.getText(MyPDF); //use FileWriter to open the text file and write the text to it FileWriter textfile = new FileWriter(String FilePath); //Writing text to the text file textfile.write(text); textfile.close(); MyPDF.close(); } catch (IOException e) { e.printStackTrace(); } } }
Sample input PDF file:
Output:
The output text file will look like this:
Here is a related tutorial to learn more about PDFBox: How to generate PDF invoice using Java
Leave a Reply