Extraction Functions
The three primary extraction functions share a common interface — they all accept a document path orpymupdf.Document instance, support the pages parameter for partial extraction, and handle OCR automatically.
to_markdown()
Extract content as a Markdown string or per-page chunk dictionaries. The primary function for LLM ingestion and RAG pipelines.
to_json()
Extract content as structured JSON with bounding boxes, font metadata, and layout data for every block on the page.
to_text()
Extract content as plain text, stripped of all Markdown syntax.
Analysis Functions
use_layout()
Analyse the visual layout of a document and return detected regions — columns, headers, figures, sidebars — with reading order and bounding boxes.
get_key_values()
Extract every word in the document as an individual dictionary with its bounding box and positional indices. Used for redaction, search, and ML pipelines.
Classes
LlamaMarkdownReader
A LlamaIndex
BaseReader implementation. Loads documents as Document objects for use in LlamaIndex pipelines and vector stores.IdentifyHeaders
Detects repeating page headers and footers. Returns bounding boxes and a
get_margins() helper for passing directly to extraction functions.TocHeaders
Extracts heading hierarchy from an embedded table of contents or infers it from font sizes. Returns a structured list of heading entries with levels and page numbers.
Utilities
version
Returns the version string for PyMuPDF4LLM.
Quick Reference
| Function / Class | Returns | Key Parameters |
|---|---|---|
to_markdown() | str or list[dict] | pages, page_chunks, use_layout, write_images |
to_json() | list[dict] | pages, margins |
to_text() | str or list[dict] | pages, page_chunks, page_separator |
use_layout() | list[dict] | pages, margins |
get_key_values() | list[dict] | pages, force_ocr |
LlamaMarkdownReader.load_data() | list[Document] | file, pages, extra_info |
IdentifyHeaders.get_margins() | tuple | body_limit |
TocHeaders.headers | list[dict] | body_limit |
version | str | — |