Overview
PDF4LLM can extract images and graphics from documents in two ways: writing them as files to disk, or embedding them as Base64-encoded data URIs directly in the Markdown output. When images are written to disk, their paths are referenced inline using standard Markdown image syntax. Image extraction is disabled by default. To enable it, passwriteImages: true to ToMarkdown().
Writing images to disk
WhenwriteImages: true is set, each image found in the document is saved as an individual file. The path to each image is embedded in the Markdown output:
imagePath to specify a different output directory:
Unlike the Python library, PDF4LLM for .NET does not create the output directory automatically. Create it before calling
ToMarkdown() or you will get a DirectoryNotFoundException:Image format
Use theimageFormat parameter to control the file format of extracted images. Pass the format as a lowercase file extension string:
| Format | Best for | Notes |
|---|---|---|
"png" | Diagrams, screenshots, charts | Lossless. Larger file size. Default. |
"jpg" | Photographs, scanned pages | Lossy. Smaller file size. |
"webp" | Web delivery | Good compression, broad browser support. |
"tiff" | Archival, OCR pre-processing | Lossless. Large file size. |
"bmp" | Maximum compatibility | Uncompressed. Very large file size. |
"pnm" | OCR pre-processing pipelines | Portable bitmap format. |
Embedded vs. file images
File images (write to disk)
When usingToMarkdown() with writeImages: true, images are saved to disk and referenced by path in the Markdown output:
Embedded images (inline Base64)
SetembedImages: true to encode images as Base64 data URIs and embed them directly in the Markdown — no files are written to disk:
writeImages and embedImages are mutually exclusive. If both are set to true, embedImages takes precedence and no files are written to disk.Vector graphics
PDF4LLM detects vector drawings — lines, shapes, and filled regions — and includes their bounding boxes in the layout analysis. Vector graphic regions are represented as"image" type blocks in ToJson() output, giving you their position on the page so you can identify and handle them in your pipeline.
Image file naming
Extracted image files are named automatically using the pattern:document.pdf, saved as PNG to assets/images/:
Full example
For the full API signature, see the ToMarkdown() API reference.
Next steps
Extract Markdown
Full walkthrough of ToMarkdown() with all common options.
Extract JSON
Access image bounding boxes via the JSON output.
Tables
Table extraction explained.