Overview
to_text() extracts the content of a document as a plain text string — no Markdown syntax, no bounding boxes, no metadata. It’s the simplest output format and the right choice when your downstream tool doesn’t need formatting or structure, just the words.
When to Use Plain Text
| Use Case | Recommended Format |
|---|---|
| Search indexing | ✅ Plain text |
| Keyword extraction / NLP | ✅ Plain text |
| LLM summarisation (simple) | ✅ Plain text |
| RAG pipelines with chunking | ⚠️ Consider Markdown or page chunks |
| Preserving document structure | ❌ Use Markdown |
| Custom layout pipelines | ❌ Use JSON |
Page Selection
Extract only the pages you need:Page Chunks
As withto_markdown(), you can return a list of per-page dictionaries using page_chunks=True:
text object with the plain text for that page and a metadata dictionary with page number and document information.
Saving to a File
Write the output to a.txt file using pathlib:
OCR Behaviour
Liketo_markdown(), to_text() triggers OCR automatically on pages with no selectable text. To enable or disable auto-OCR capabilities:
For the full API signature, see the
to_text() API reference.Next Steps
Extract Markdown
Preserve structure and formatting for LLM pipelines.
Extract JSON
Access bounding boxes and layout data for custom pipelines.
Saving Output
Write .md, .json, and .txt files with pathlib.
OCR
Control automatic OCR behaviour and adaptors.