Extracting images from a PDF using Python
Hey there! In this tutorial, we will be learning to extract images contained within a PDF file using Python.
Implementation
Step 1
Open PyCharm and create a project titled PDF_Images. Save the desired PDF within this project. Then, open the terminal and type the below-listed commands to install the respective libraries:
pip install PyMuPDF pip install Pillow
- PyMuPDF: A Python binding for MuPDF, a lightweight PDF viewer.
- Pillow: A Python Imaging Library (PIL) that supports image processing capabilities such as opening, manipulating, and saving images of various formats.
Step 2
Within the main.py file in this project, type the below-specified code. Refer to the code’s comments for an explanation regarding the code.
# Import necessary libraries: import fitz import io from PIL import Image # open the desired PDF file: pdf = fitz.open("demo.pdf") # Determine number of pages in the PDF file: pages = len(pdf) # Iterate over each of the PDF pages: # Index of 1st page -> 0 for i in range(pages): # Access the page at index 'i': page = pdf[i] # Access all image objects present in this page: image_list = page.getImageList() # Iterate through these image objects: for image_count, img in enumerate(image_list, start=1): # Access XREF of the image: xref = img[0] # Extract image information: img_info = pdf.extractImage(xref) # Extract image bytes: image_bytes = img_info["image"] # Access image extension: image_ext = img_info["ext"] # Load this image to PIL: image = Image.open(io.BytesIO(image_bytes)) # To save this image: image.save(open(f"page{i+1}_image{image_count}.{image_ext}", "wb"))
This code aims at extracting all the images contained within the PDF. If you wish to extract images from a particular range of pages, then pass this range within the for-loop at line #13 in the above code.
Output
Click here, to view the PDF used for demonstration purposes.
The below-attached image shows that all the images extracted from this PDF are named appropriately and stored within this project.
Also read, Extracting Text from a Pdf file in Python
Leave a Reply