PyMuPDF4LLM
Document extraction that doesn’t get in your way
Your LLM is only as good as the content you feed it. Most PDF libraries hand you a wall of unstructured text and leave you to figure out the rest. PyMuPDF4LLM gives you something you can actually use — clean Markdown, structured JSON, or plain text, with reading order preserved, tables intact, and images handled — in a single function call. Built on PyMuPDF, the fastest PDF processing engine available in Python, it is trusted by developers building production RAG pipelines, document intelligence systems, and LLM-powered applications worldwide.Get started in minutes
One pip install. One function call. Clean output.
See it in action
Convert your first PDF to Markdown in five lines.
Everything your pipeline needs. Nothing it doesn’t.
Reading order that makes sense
Multi-column layouts, sidebars, and complex designs are reconstructed in the correct sequence — so your LLM reads the document the way a human would.
Tables that stay intact
Detected automatically and rendered as structured Markdown. No more table data scrambled across disconnected lines.
OCR without the friction
Scanned and image-based pages are detected and processed automatically. No configuration. No manual triggers.
RAG-ready from the start
Per-page chunk dictionaries carry everything downstream needs — text, metadata, TOC entries, table positions, and word-level coordinates.
AI-powered layout analysis
Optional PyMuPDF-Layout integration brings state-of-the-art AI region detection for the most complex and demanding documents.
Plugs into your stack
Native loaders for LlamaIndex and LangChain. Drop into your existing pipeline with zero glue code.
Three output formats. One consistent API.
Whether you’re building a RAG pipeline, a custom document intelligence system, or a data extraction workflow, PyMuPDF4LLM produces the format you need.| Format | Best for |
|---|---|
| Markdown | LLM ingestion, RAG pipelines, and human-readable output with structure preserved |
| JSON | Custom pipelines that need bounding boxes, font data, and per-block layout metadata |
| Plain Text | Search indexing, NLP preprocessing, and tools that don’t need formatting |
Works with the documents you already have
Standard formats
PDF, XPS, EPUB, MOBI, and more — supported out of the box, no extra setup required.
Office formats with Pro
Unlock DOCX, PPTX, XLSX, and more with PyMuPDF Pro. The same clean API. The same consistent output.
Trusted at every scale
PyMuPDF4LLM is built for production. It handles everything from single-page invoices to thousands of pages of legal, financial, or technical documentation — with predictable performance and output quality you can rely on.Performance
Built on PyMuPDF, the fastest PDF engine available in Python — benchmarked faster than every major alternative.
Accuracy
AI-powered layout analysis and best-in-class table detection means fewer pipeline errors and less manual correction.
Flexibility
Swap OCR engines, customise layout detection, apply page margins, or tune every extraction parameter to your needs.