Skip to main content

Input Formats

PyMuPDF4LLM can open and extract content from the following document types:
FormatExtensionsNotes
PDF.pdfAll versions, including encrypted and scanned
XPS.xpsMicrosoft XML Paper Specification
eBooks.epub, .mobi, .fb2Reflowable content is linearised per chapter
Comic Books.cbzImage-based pages; OCR recommended
Office Documents.doc, .docx, .ppt, .pptx, .xls, .xlsx, .hwp, .hwpxPyMuPDF Pro only — see below
Standard PyMuPDF4LLM supports PDF, XPS, eBooks, and CBZ out of the box. Office format support requires a PyMuPDF Pro licence.

Office Documents (Pro Only)

Processing Office files requires PyMuPDF Pro, which converts documents to PDF internally before extraction. This means all standard extraction options — layout analysis, OCR, page chunks — work identically on Office files.
import pymupdf4llm

# Requires PyMuPDF Pro licence
md_text = pymupdf4llm.to_markdown("report.doc")

PyMuPDF Pro

Learn how to install and activate PyMuPDF Pro for Office document support.

Output Formats

PyMuPDF4LLM can produce output in four formats depending on your use case:
FormatFunctionBest For
Markdownto_markdown()LLM ingestion, RAG pipelines, readable docs
JSONto_json()Custom pipelines needing bounding boxes and layout data
Plain Textto_text()Simple text extraction, search indexing
Imagesto_markdown(write_images=True)Preserving figures, charts, and diagrams

Markdown

The default and most commonly used output format. Text is extracted in reading order with headings, lists, tables, and inline formatting preserved where detectable.
md_text = pymupdf4llm.to_markdown("document.pdf")

JSON

Returns structured data including bounding boxes, font information, and layout metadata for every block on the page. Useful for building custom post-processing pipelines.
json_output = pymupdf4llm.to_json("document.pdf")

Plain Text

Strips all formatting and returns raw text content. Ideal when downstream tools do not need Markdown syntax.
text = pymupdf4llm.to_text("document.pdf")

Images

When write_images=True is passed to to_markdown(), embedded images and graphics are extracted and saved to disk. Image paths are referenced inline in the Markdown output.
md_text = pymupdf4llm.to_markdown("document.pdf", write_images=True, image_path="images/")

Next Steps

Extract Markdown

Full walkthrough of to_markdown() with common options.

Images & Graphics

Controlling image extraction, DPI, and output path.

PyMuPDF Pro

Unlock Office document support with PyMuPDF Pro.