Extracting images from a PDF using Python

Hey there! In this tutorial, we will be learning to extract images contained within a PDF file using Python.


Step 1

Open PyCharm and create a project titled PDF_Images. Save the desired PDF within this project. Thenopen the terminal and type the below-listed commands to install the respective libraries:

pip install PyMuPDF
pip install Pillow
  • PyMuPDF: A Python binding for MuPDF, a lightweight PDF viewer.
  • Pillow: A Python Imaging Library (PIL) that supports image processing capabilities such as opening, manipulating, and saving images of various formats.

Step 2

Within the main.py file in thiproject, type the below-specified code. Refer to the code’s comments for an explanation regarding the code.

# Import necessary libraries:
import fitz
import io
from PIL import Image

# open the desired PDF file:
pdf = fitz.open("demo.pdf")
# Determine number of pages in the PDF file:
pages = len(pdf)

# Iterate over each of the PDF pages:
# Index of 1st page -> 0
for i in range(pages):
    # Access the page at index 'i':
    page = pdf[i]
    # Access all image objects present in this page:
    image_list = page.getImageList()
    # Iterate through these image objects:
    for image_count, img in enumerate(image_list, start=1):
        # Access XREF of the image:
        xref = img[0]
        # Extract image information:
        img_info = pdf.extractImage(xref)
        # Extract image bytes:
        image_bytes = img_info["image"]
        # Access image extension:
        image_ext = img_info["ext"]
        # Load this image to PIL:
        image = Image.open(io.BytesIO(image_bytes))
        # To save this image:
        image.save(open(f"page{i+1}_image{image_count}.{image_ext}", "wb"))

This code aims at extracting all the images contained within the PDF. If you wish to extract images from a particular range of pages, then pass this range within the for-loop at line #13 in the above code.


Click here, to view the PDF used for demonstration purposes.

The below-attached image shows that all the images extracted from this PDF are named appropriately and stored within this project.

