Skip to main content

Signature

pymupdf4llm.to_json(
    doc: str | pymupdf.Document,
    **kwargs
) -> list[dict]

Parameters

doc
str | pymupdf.Document
required
Path to the document file, or an already-opened pymupdf.Document instance. Supports PDF, XPS, eBooks, and — with PyMuPDF Pro — Office formats.
**kwargs
various
Additional parameters are shared with to_markdown(). See the to_markdown() API reference for details.
For other parameters, see the shared to_markdown() API reference which applies to all extraction functions.

Returns

list[dict]
list
A list of page objects, one per extracted page. See JSON Schema for the full field reference.See Extract JSON for detailed block structure examples.

Raises

ExceptionCondition
FileNotFoundErrordoc is a path string that does not exist
ValueErrorAn index in pages is out of range for the document
ImportErrorocr=True or force_ocr=True but the ocr dependency is not installed

Examples

Minimal

import pymupdf4llm

data = pymupdf4llm.to_json("document.pdf")

Iterate over blocks

for page_num, page in enumerate(data.get("pages", [])):
    for block in page.get("boxes", []):
        for line in block.get("textlines", []):
            for span in line.get("spans", []):
                bbox = span.get("bbox", []) # bounding box for this text span
                text = span.get("text", "") # text content of the span
                flags = span.get("flags", 0) # font style flags (bitmask)

See Also

Extract JSON Guide

Full walkthrough with bounding boxes, span flags, and pipeline examples.

JSON Schema

Complete field reference for every object in the JSON output.

to_markdown()

Markdown output for LLM ingestion and readable docs.

Tables Guide

Working with table blocks in the JSON output.