Input Formats
PyMuPDF4LLM can open and extract content from the following document types:| Format | Extensions | Notes |
|---|---|---|
.pdf | All versions, including encrypted and scanned | |
| XPS | .xps | Microsoft XML Paper Specification |
| eBooks | .epub, .mobi, .fb2 | Reflowable content is linearised per chapter |
| Comic Books | .cbz | Image-based pages; OCR recommended |
| Office Documents | .doc, .docx, .ppt, .pptx, .xls, .xlsx, .hwp, .hwpx | PyMuPDF Pro only — see below |
Standard PyMuPDF4LLM supports PDF, XPS, eBooks, and CBZ out of the box. Office format support requires a PyMuPDF Pro licence.
Office Documents (Pro Only)
Processing Office files requires PyMuPDF Pro, which converts documents to PDF internally before extraction. This means all standard extraction options — layout analysis, OCR, page chunks — work identically on Office files.PyMuPDF Pro
Learn how to install and activate PyMuPDF Pro for Office document support.
Output Formats
PyMuPDF4LLM can produce output in four formats depending on your use case:| Format | Function | Best For |
|---|---|---|
| Markdown | to_markdown() | LLM ingestion, RAG pipelines, readable docs |
| JSON | to_json() | Custom pipelines needing bounding boxes and layout data |
| Plain Text | to_text() | Simple text extraction, search indexing |
| Images | to_markdown(write_images=True) | Preserving figures, charts, and diagrams |
Markdown
The default and most commonly used output format. Text is extracted in reading order with headings, lists, tables, and inline formatting preserved where detectable.JSON
Returns structured data including bounding boxes, font information, and layout metadata for every block on the page. Useful for building custom post-processing pipelines.Plain Text
Strips all formatting and returns raw text content. Ideal when downstream tools do not need Markdown syntax.Images
Whenwrite_images=True is passed to to_markdown(), embedded images and graphics are extracted and saved to disk. Image paths are referenced inline in the Markdown output.
Next Steps
Extract Markdown
Full walkthrough of
to_markdown() with common options.Images & Graphics
Controlling image extraction, DPI, and output path.
PyMuPDF Pro
Unlock Office document support with PyMuPDF Pro.