Skip to main content

Signature

pymupdf4llm.to_text(
    doc: str | pymupdf.Document,
    **kwargs
) -> str  | list[dict]

Parameters

doc
str | pymupdf.Document
required
Path to the document file, or an already-opened pymupdf.Document instance. Supports PDF, XPS, eBooks, and — with PyMuPDF Pro — Office formats.
**kwargs
various
Additional parameters are shared with to_markdown(). See the to_markdown() API reference for details.
For other parameters, see the shared to_markdown() API reference which applies to all extraction functions.

Returns

str
string
When page_chunks=False (default). A single plain text string containing all extracted pages.
list[dict]
list
When page_chunks=True. A list of dictionaries, one per extracted page, each with the following keys:
KeyTypeDescription
textstrPlain text content of the page
metadatadictPage metadata

Raises

ExceptionCondition
FileNotFoundErrordoc is a path string that does not exist
ValueErrorAn index in pages is out of range for the document
ImportErrorocr=True or force_ocr=True but the ocr dependency is not installed

Examples

Minimal

import pymupdf4llm

text = pymupdf4llm.to_text("document.pdf")

Page chunks

chunks = pymupdf4llm.to_text("document.pdf", page_chunks=True)
for chunk in chunks:
    print(f"Page {chunk['metadata']['page']}: {len(chunk['text'])} chars")
    print(chunk['text'])

Save to file

from pathlib import Path

text = pymupdf4llm.to_text("document.pdf")
Path("output.txt").write_text(text, encoding="utf-8")

See Also

Extract Text Guide

Full guided overview.

to_markdown()

Markdown output preserving document structure.

to_json()

Structured output with bounding boxes and layout data.