API

Extraction Functions

The three primary extraction functions share a common interface — they all accept a document path or pymupdf.Document instance, support the pages parameter for partial extraction, and handle OCR automatically.

to_markdown()

Extract content as a Markdown string or per-page chunk dictionaries. The primary function for LLM ingestion and RAG pipelines.

to_json()

Extract content as structured JSON with bounding boxes, font metadata, and layout data for every block on the page.

to_text()

Extract content as plain text, stripped of all Markdown syntax.

Analysis Functions

use_layout()

Analyse the visual layout of a document and return detected regions — columns, headers, figures, sidebars — with reading order and bounding boxes.

get_key_values()

Extract every word in the document as an individual dictionary with its bounding box and positional indices. Used for redaction, search, and ML pipelines.

Classes

LlamaMarkdownReader

A LlamaIndex BaseReader implementation. Loads documents as Document objects for use in LlamaIndex pipelines and vector stores.

IdentifyHeaders

Detects repeating page headers and footers. Returns bounding boxes and a get_margins() helper for passing directly to extraction functions.

TocHeaders

Extracts heading hierarchy from an embedded table of contents or infers it from font sizes. Returns a structured list of heading entries with levels and page numbers.

Utilities

version

Returns the version string for PyMuPDF4LLM.

Quick Reference

Function / Class	Returns	Key Parameters
`to_markdown()`	`str` or `list[dict]`	`pages`, `page_chunks`, `use_layout`, `write_images`
`to_json()`	`list[dict]`	`pages`, `margins`
`to_text()`	`str` or `list[dict]`	`pages`, `page_chunks`, `page_separator`
`use_layout()`	`list[dict]`	`pages`, `margins`
`get_key_values()`	`list[dict]`	`pages`, `force_ocr`
`LlamaMarkdownReader.load_data()`	`list[Document]`	`file`, `pages`, `extra_info`
`IdentifyHeaders.get_margins()`	`tuple`	`body_limit`
`TocHeaders.headers`	`list[dict]`	`body_limit`
`version`	`str`	—

Getting Started

Guides

Integrations

Reference

Extraction Functions

to_markdown()

to_json()

to_text()

Analysis Functions

use_layout()

get_key_values()

Classes

LlamaMarkdownReader

IdentifyHeaders

TocHeaders

Utilities

version

Quick Reference

Getting Started

Guides

Integrations

Reference

​Extraction Functions

to_markdown()

to_json()

to_text()

​Analysis Functions

use_layout()

get_key_values()

​Classes

LlamaMarkdownReader

IdentifyHeaders

TocHeaders

​Utilities

version

​Quick Reference

Extraction Functions

Analysis Functions

Classes

Utilities

Quick Reference