Skip to main content

PyMuPDF4LLM

Document extraction that doesn’t get in your way

Your LLM is only as good as the content you feed it. Most PDF libraries hand you a wall of unstructured text and leave you to figure out the rest. PyMuPDF4LLM gives you something you can actually use — clean Markdown, structured JSON, or plain text, with reading order preserved, tables intact, and images handled — in a single function call. Built on PyMuPDF, the fastest PDF processing engine available in Python, it is trusted by developers building production RAG pipelines, document intelligence systems, and LLM-powered applications worldwide.

Get started in minutes

One pip install. One function call. Clean output.

See it in action

Convert your first PDF to Markdown in five lines.

Everything your pipeline needs. Nothing it doesn’t.

Reading order that makes sense

Multi-column layouts, sidebars, and complex designs are reconstructed in the correct sequence — so your LLM reads the document the way a human would.

Tables that stay intact

Detected automatically and rendered as structured Markdown. No more table data scrambled across disconnected lines.

OCR without the friction

Scanned and image-based pages are detected and processed automatically. No configuration. No manual triggers.

RAG-ready from the start

Per-page chunk dictionaries carry everything downstream needs — text, metadata, TOC entries, table positions, and word-level coordinates.

AI-powered layout analysis

Optional PyMuPDF-Layout integration brings state-of-the-art AI region detection for the most complex and demanding documents.

Plugs into your stack

Native loaders for LlamaIndex and LangChain. Drop into your existing pipeline with zero glue code.

Three output formats. One consistent API.

Whether you’re building a RAG pipeline, a custom document intelligence system, or a data extraction workflow, PyMuPDF4LLM produces the format you need.
FormatBest for
MarkdownLLM ingestion, RAG pipelines, and human-readable output with structure preserved
JSONCustom pipelines that need bounding boxes, font data, and per-block layout metadata
Plain TextSearch indexing, NLP preprocessing, and tools that don’t need formatting

Works with the documents you already have

Standard formats

PDF, XPS, EPUB, MOBI, and more — supported out of the box, no extra setup required.

Office formats with Pro

Unlock DOCX, PPTX, XLSX, and more with PyMuPDF Pro. The same clean API. The same consistent output.

Trusted at every scale

PyMuPDF4LLM is built for production. It handles everything from single-page invoices to thousands of pages of legal, financial, or technical documentation — with predictable performance and output quality you can rely on.

Performance

Built on PyMuPDF, the fastest PDF engine available in Python — benchmarked faster than every major alternative.

Accuracy

AI-powered layout analysis and best-in-class table detection means fewer pipeline errors and less manual correction.

Flexibility

Swap OCR engines, customise layout detection, apply page margins, or tune every extraction parameter to your needs.