Supported Formats

Input Formats

PyMuPDF4LLM can open and extract content from the following document types:

Format	Extensions	Notes
PDF	`.pdf`	All versions, including encrypted and scanned
XPS	`.xps`	Microsoft XML Paper Specification
eBooks	`.epub`, `.mobi`, `.fb2`	Reflowable content is linearised per chapter
Comic Books	`.cbz`	Image-based pages; OCR recommended
Office Documents	`.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx`, `.hwp`, `.hwpx`	PyMuPDF Pro only — see below

Standard PyMuPDF4LLM supports PDF, XPS, eBooks, and CBZ out of the box. Office format support requires a PyMuPDF Pro licence.

Office Documents (Pro Only)

Processing Office files requires PyMuPDF Pro, which converts documents to PDF internally before extraction. This means all standard extraction options — layout analysis, OCR, page chunks — work identically on Office files.

import pymupdf4llm

# Requires PyMuPDF Pro licence
md_text = pymupdf4llm.to_markdown("report.doc")

PyMuPDF Pro

Learn how to install and activate PyMuPDF Pro for Office document support.

Output Formats

PyMuPDF4LLM can produce output in four formats depending on your use case:

Format	Function	Best For
Markdown	`to_markdown()`	LLM ingestion, RAG pipelines, readable docs
JSON	`to_json()`	Custom pipelines needing bounding boxes and layout data
Plain Text	`to_text()`	Simple text extraction, search indexing
Images	`to_markdown(write_images=True)`	Preserving figures, charts, and diagrams

Markdown

The default and most commonly used output format. Text is extracted in reading order with headings, lists, tables, and inline formatting preserved where detectable.

md_text = pymupdf4llm.to_markdown("document.pdf")

JSON

Returns structured data including bounding boxes, font information, and layout metadata for every block on the page. Useful for building custom post-processing pipelines.

json_output = pymupdf4llm.to_json("document.pdf")

Plain Text

Strips all formatting and returns raw text content. Ideal when downstream tools do not need Markdown syntax.

text = pymupdf4llm.to_text("document.pdf")

Images

When write_images=True is passed to to_markdown(), embedded images and graphics are extracted and saved to disk. Image paths are referenced inline in the Markdown output.

md_text = pymupdf4llm.to_markdown("document.pdf", write_images=True, image_path="images/")

Next Steps

Extract Markdown

Full walkthrough of to_markdown() with common options.

Images & Graphics

Controlling image extraction, DPI, and output path.

PyMuPDF Pro

Unlock Office document support with PyMuPDF Pro.

Getting Started

Guides

Integrations

Reference

Supported Formats

Input Formats

Office Documents (Pro Only)

PyMuPDF Pro

Output Formats

Markdown

JSON

Plain Text

Images

Next Steps

Extract Markdown

Images & Graphics

PyMuPDF Pro

Getting Started

Guides

Integrations

Reference

​Input Formats

​Office Documents (Pro Only)

PyMuPDF Pro

​Output Formats

​Markdown

​JSON

​Plain Text

​Images

​Next Steps

Extract Markdown

Images & Graphics

PyMuPDF Pro

Input Formats

Office Documents (Pro Only)

Output Formats

Markdown

JSON

Plain Text

Images

Next Steps