Overview
PyMuPDF4LLM can extract images and graphics from documents in two ways: writing them as files to disk, or embedding them as base64-encoded data in the JSON output. When images are written to disk, their paths are referenced inline in the Markdown output using standard image syntax. Image extraction is disabled by default. To enable it, passwrite_images=True to to_markdown().
Writing Images to Disk
Whenwrite_images=True is set, each image found in the document is saved as an individual file. The path to each image is embedded in the Markdown output:
image_path to specify a different output directory:
PyMuPDF4LLM will create the output directory automatically.
Image Format
Use theimage_format parameter to control the file format of extracted images. Supported formats are 'png', 'pnm', 'pgm', 'ppm', 'pbm', 'pam', 'psd', 'ps', 'jpg', 'jpeg':
| Format | Best For | Notes |
|---|---|---|
"png" | Diagrams, screenshots, charts | Lossless. Larger file size. Default. |
"jpeg" | Photographs, scanned pages | Lossy. Smaller file size. |
DPI and Resolution
Thedpi parameter controls the resolution at which raster images are rendered. The default is 150 DPI, which is a good balance between file size and clarity.
| DPI | Use Case |
|---|---|
72 | Low-quality preview thumbnails |
150 | Standard extraction (default) |
300 | Print-quality or OCR pre-processing |
Embedded vs. File Images
File Images (Markdown)
When usingto_markdown() with write_images=True, images are written to disk and referenced by path in the Markdown:
Embedded Images
When usingto_markdown() or to_json(), images can be included directly in the output as base64-encoded byte strings by setting the embed_images parameter to True— no files are written to disk:
Vector Graphics
PyMuPDF4LLM detects vector drawings — lines, shapes, filled regions and can rasterise them to image files by default, but their bounding boxes are preserved so you can identify and handle them in your pipeline.Image File Naming
Extracted image files are named automatically using the pattern:document.pdf would be saved as:
Full Example
For the full API signature, see the
to_markdown() API reference & to_json() API reference.Next Steps
Extract Markdown
Full walkthrough of to_markdown() with all common options.
Extract JSON
Access embedded image data via the JSON output.
Tables
Table extraction explained.
Saving Output
Write Markdown and image files together with pathlib.