Extract Text

Overview

to_text() extracts the content of a document as a plain text string — no Markdown syntax, no bounding boxes, no metadata. It’s the simplest output format and the right choice when your downstream tool doesn’t need formatting or structure, just the words.

import pymupdf4llm

text = pymupdf4llm.to_text("document.pdf")
print(text)

When to Use Plain Text

Use Case	Recommended Format
Search indexing	✅ Plain text
Keyword extraction / NLP	✅ Plain text
LLM summarisation (simple)	✅ Plain text
RAG pipelines with chunking	⚠️ Consider Markdown or page chunks
Preserving document structure	❌ Use Markdown
Custom layout pipelines	❌ Use JSON

If you’re feeding content into an LLM and document structure matters — headings, lists, tables — use to_markdown() instead. LLMs handle Markdown well and the added structure improves output quality.

Page Selection

Extract only the pages you need:

text = pymupdf4llm.to_text("document.pdf", pages=[0, 1, 2])

Page Chunks

As with to_markdown(), you can return a list of per-page dictionaries using page_chunks=True:

chunks = pymupdf4llm.to_text("document.pdf", page_chunks=True)

for chunk in chunks:
    print(f"Page {chunk['metadata']}: {len(chunk['text'])} chars")

Each chunk contains a text object with the plain text for that page and a metadata dictionary with page number and document information.

Saving to a File

Write the output to a .txt file using pathlib:

import pymupdf4llm
from pathlib import Path

text = pymupdf4llm.to_text("document.pdf")
Path("output.txt").write_text(text, encoding="utf-8")

OCR Behaviour

Like to_markdown(), to_text() triggers OCR automatically on pages with no selectable text. To enable or disable auto-OCR capabilities:

# Enable OCR on all pages
text = pymupdf4llm.to_text("document.pdf", use_ocr=True)

# Disable OCR entirely
text = pymupdf4llm.to_text("document.pdf", use_ocr=False)

See OCR for a full walkthrough of OCR options and adaptors.

For the full API signature, see the to_text() API reference.

Next Steps

Extract Markdown

Preserve structure and formatting for LLM pipelines.

Extract JSON

Access bounding boxes and layout data for custom pipelines.

Saving Output

Write .md, .json, and .txt files with pathlib.

OCR

Control automatic OCR behaviour and adaptors.

Getting Started

Guides

Integrations

Reference

Overview

When to Use Plain Text

Page Selection

Page Chunks

Saving to a File

OCR Behaviour

Next Steps

Extract Markdown

Extract JSON

Saving Output

OCR

Getting Started

Guides

Integrations

Reference

​Overview

​When to Use Plain Text

​Page Selection

​Page Chunks

​Saving to a File

​OCR Behaviour

​Next Steps

Extract Markdown

Extract JSON

Saving Output

OCR

Overview

When to Use Plain Text

Page Selection

Page Chunks

Saving to a File

OCR Behaviour

Next Steps