Skip to main content

Overview

to_text() extracts the content of a document as a plain text string — no Markdown syntax, no bounding boxes, no metadata. It’s the simplest output format and the right choice when your downstream tool doesn’t need formatting or structure, just the words.
import pymupdf4llm

text = pymupdf4llm.to_text("document.pdf")
print(text)

When to Use Plain Text

Use CaseRecommended Format
Search indexing✅ Plain text
Keyword extraction / NLP✅ Plain text
LLM summarisation (simple)✅ Plain text
RAG pipelines with chunking⚠️ Consider Markdown or page chunks
Preserving document structure❌ Use Markdown
Custom layout pipelines❌ Use JSON
If you’re feeding content into an LLM and document structure matters — headings, lists, tables — use to_markdown() instead. LLMs handle Markdown well and the added structure improves output quality.

Page Selection

Extract only the pages you need:
text = pymupdf4llm.to_text("document.pdf", pages=[0, 1, 2])

Page Chunks

As with to_markdown(), you can return a list of per-page dictionaries using page_chunks=True:
chunks = pymupdf4llm.to_text("document.pdf", page_chunks=True)

for chunk in chunks:
    print(f"Page {chunk['metadata']}: {len(chunk['text'])} chars")
Each chunk contains a text object with the plain text for that page and a metadata dictionary with page number and document information.

Saving to a File

Write the output to a .txt file using pathlib:
import pymupdf4llm
from pathlib import Path

text = pymupdf4llm.to_text("document.pdf")
Path("output.txt").write_text(text, encoding="utf-8")

OCR Behaviour

Like to_markdown(), to_text() triggers OCR automatically on pages with no selectable text. To enable or disable auto-OCR capabilities:
# Enable OCR on all pages
text = pymupdf4llm.to_text("document.pdf", use_ocr=True)

# Disable OCR entirely
text = pymupdf4llm.to_text("document.pdf", use_ocr=False)
See OCR for a full walkthrough of OCR options and adaptors.
For the full API signature, see the to_text() API reference.

Next Steps

Extract Markdown

Preserve structure and formatting for LLM pipelines.

Extract JSON

Access bounding boxes and layout data for custom pipelines.

Saving Output

Write .md, .json, and .txt files with pathlib.

OCR

Control automatic OCR behaviour and adaptors.