How to extract texts from a PDF file in PHP

Greetings programmers, in this tutorial we will see how to extract texts from a PDF file format in PHP.

Often programmers need to extract texts from various forms of data to do various types of manipulations with the meaningful data extracted. In this tutorial, we will extract the texts from a PDF file in a text file.

Before executing the program, we need to install the XAMPP server. Install XAMPP on your computer. Once done, go to the file location of the XAMPP server, enter the ‘htdocs’ directory and create a new PHP file required for the manipulations. The file location will be the following: ‘C:\Program Files\XAMPP\htdocs’.

We need to understand some of the header terms which has to be included for viewing the PDF file in the browser and get the output in the text file.

We include the following two headers for reading through the pdf file.

header('Content-Type: application/pdf');

The following line is used to send the information to the browser that the file is a PDF file that will be used for manipulations. It is mainly used to inform the browser about the file type.

header('Content-Disposition: inline; filename = "' . $pdfFile . '"');

The following line is used for displaying the file in the browser. It basically means that the content which is passed is a part of the web page.

The content of the pdf file is given below:

Hello from Codespeedy Technologies!!!!

PHP code to extract texts from PDF

Given below is the code for displaying the file in the browser and writing the text in another file.

<?php
    $pdfFile = '<pdfFile>.pdf'; #enter the pdf file name

    header('Content-Type: application/pdf'); #sending information browser about the pdf file
    header('Content-Disposition: inline; filename = "' . $pdfFile . '"'); #used for displaying the pdf file in browser
    @readfile($pdfFile); #reading the pdf file

    $outFile = fopen("<outputFile>.txt", "a") or die("Unable to open file!"); #creating a new file to insert the text from pdf

    while(!feof($pdfFile)) #while it is not the end of the file
    {
        $txt = fgets($pdfFile); #get the text from the file
        fwrite($outFile, $txt); #writes in the output file
    }

    fclose($pdfFile); #closing the pdf file
    fclose($outFile); #closing the output file
?>

Output

The pdf file is opened in the browser and a file is created in the same directory with the extracted texts from the pdf.

The contents of the output file are shown below.

Hello from Codespeedy Technologies!!!!

Explanation
We include all the necessary headers as discussed above. We keep the PDF file in the same directory and run the above code. On execution of the code, the pdf file is opened in the browser and then we make a new file which will be the output file where we copy all the extracted texts from the PDF and store it. After the manipulation is done, we close the file.

Leave a Reply

Your email address will not be published.