Extraction methods
The three primary extraction methods share a common interface — they all accept a file path string or an openMuPDF.NET.Document, support the pages parameter for partial extraction, and return a string you can write directly to disk or pass downstream.
ToMarkdown()
Extract content as a GitHub-compatible Markdown string. The primary method for LLM ingestion and RAG pipelines. Supports image extraction, OCR, and per-page output via
LlamaMarkdownReader.ToJson()
Extract content as structured JSON with bounding boxes and layout data for every block on the page. Use for custom pipelines, positional filtering, and debugging extraction output.
ToText()
Extract content as plain text, stripped of all Markdown syntax. Use for search indexing, NLP pipelines, and systems that render Markdown literally.
Layout and structure methods
ParseDocument()
Analyse the visual layout of a document and return a typed
ParsedDocument object — pages, text blocks, tables, and image regions — with bounding boxes and reading order. The in-process equivalent of ToJson().GetKeyValues()
Extract all interactive AcroForm field names, values, and page locations from a PDF. Use for structured data extraction from filled-in forms.
Reader types
PDFMarkdownReader
A LlamaIndex-compatible document reader. Created via
PdfExtractor.LlamaMarkdownReader(). Loads a PDF and returns one LlamaDocument per page, each with Markdown text and metadata including page number and source file path.Return types
ParsedDocument
Typed .NET object returned by
ParseDocument(). Contains a list of ParsedPage objects, each with its blocks, tables, images, and dimensions.FormField
Represents a single AcroForm field returned by
GetKeyValues(). Exposes Name, Value, and Page properties.Quick reference
| Method / Type | Returns | Key parameters |
|---|---|---|
PdfExtractor.ToMarkdown() | string | pages, writeImages, embedImages, useOcr, ocrLanguage, forceText |
PdfExtractor.ToJson() | string (JSON) | pages, showProgress |
PdfExtractor.ToText() | string | pages, useOcr, ocrLanguage, forceText |
PdfExtractor.ParseDocument() | ParsedDocument | pages, useOcr, ocrLanguage |
PdfExtractor.GetKeyValues() | List<FormField> | doc only — no pages parameter |
PdfExtractor.LlamaMarkdownReader() | PDFMarkdownReader | — |
PDFMarkdownReader.LoadData() | List<LlamaDocument> | filePath, extraInfo |