ToText() extracts the content of a document as a plain text string — no Markdown syntax, no bounding boxes, no metadata. It’s the simplest output format and the right choice when your downstream tool doesn’t need formatting or structure, just the words.
using PDF4LLM;string text = PdfExtractor.ToText("document.pdf");Console.WriteLine(text);
If you’re feeding content into an LLM and document structure matters — headings, lists, tables — use ToMarkdown() instead. LLMs handle Markdown well and the added structure improves output quality.
Use LlamaMarkdownReader to return one document object per page instead of a single concatenated string. Each chunk includes the page’s plain text and a metadata dictionary with the page number and source file path:
var reader = PdfExtractor.LlamaMarkdownReader();var chunks = reader.LoadData("document.pdf");foreach (var chunk in chunks){ int page = (int)chunk.ExtraInfo["page"]; string text = chunk.Text; Console.WriteLine($"Page {page}: {text.Length} chars");}
Each chunk’s Text property contains the plain Markdown for that page. For plain text specifically, strip Markdown syntax after loading, or call ToText per page using the pages parameter:
using MuPDF.NET;Document doc = new Document("document.pdf");var chunks = new List<(int Page, string Text)>();for (int i = 0; i < doc.PageCount; i++){ string pageText = PdfExtractor.ToText(doc, pages: new List<int> { i }); chunks.Add((i, pageText));}doc.Close();foreach (var chunk in chunks) Console.WriteLine($"Page {chunk.Page}: {chunk.Text.Length} chars");
using System.IO;using PDF4LLM;string text = PdfExtractor.ToText("document.pdf");File.WriteAllText("output.txt", text, System.Text.Encoding.UTF8);
To save each page as a separate file:
using MuPDF.NET;using System.IO;Document doc = new Document("document.pdf");Directory.CreateDirectory("output");for (int i = 0; i < doc.PageCount; i++){ string pageText = PdfExtractor.ToText(doc, pages: new List<int> { i }); File.WriteAllText($"output/page-{i}.txt", pageText, System.Text.Encoding.UTF8);}doc.Close();
Like ToMarkdown(), ToText() can invoke Tesseract OCR on pages that contain no selectable text. Pass useOcr: true to enable it:
// Enable OCR on all pagesstring text = PdfExtractor.ToText("document.pdf", useOcr: true);// Enable OCR with a specific languagestring text = PdfExtractor.ToText("document.pdf", useOcr: true, ocrLanguage: "fra");
See OCR for a full walkthrough of Tesseract installation, language codes, and patterns for mixed documents.