Skip to main content

Extraction Functions

The three primary extraction functions share a common interface — they all accept a document path or pymupdf.Document instance, support the pages parameter for partial extraction, and handle OCR automatically.

to_markdown()

Extract content as a Markdown string or per-page chunk dictionaries. The primary function for LLM ingestion and RAG pipelines.

to_json()

Extract content as structured JSON with bounding boxes, font metadata, and layout data for every block on the page.

to_text()

Extract content as plain text, stripped of all Markdown syntax.

Analysis Functions

use_layout()

Analyse the visual layout of a document and return detected regions — columns, headers, figures, sidebars — with reading order and bounding boxes.

get_key_values()

Extract every word in the document as an individual dictionary with its bounding box and positional indices. Used for redaction, search, and ML pipelines.

Classes

LlamaMarkdownReader

A LlamaIndex BaseReader implementation. Loads documents as Document objects for use in LlamaIndex pipelines and vector stores.

IdentifyHeaders

Detects repeating page headers and footers. Returns bounding boxes and a get_margins() helper for passing directly to extraction functions.

TocHeaders

Extracts heading hierarchy from an embedded table of contents or infers it from font sizes. Returns a structured list of heading entries with levels and page numbers.

Utilities

version

Returns the version string for PyMuPDF4LLM.

Quick Reference

Function / ClassReturnsKey Parameters
to_markdown()str or list[dict]pages, page_chunks, use_layout, write_images
to_json()list[dict]pages, margins
to_text()str or list[dict]pages, page_chunks, page_separator
use_layout()list[dict]pages, margins
get_key_values()list[dict]pages, force_ocr
LlamaMarkdownReader.load_data()list[Document]file, pages, extra_info
IdentifyHeaders.get_margins()tuplebody_limit
TocHeaders.headerslist[dict]body_limit
versionstr