API

Extraction methods

The three primary extraction methods share a common interface — they all accept a file path string or an open MuPDF.NET.Document, support the pages parameter for partial extraction, and return a string you can write directly to disk or pass downstream.

ToMarkdown()

Extract content as a GitHub-compatible Markdown string. The primary method for LLM ingestion and RAG pipelines. Supports image extraction, OCR, and per-page output via LlamaMarkdownReader.

ToJson()

Extract content as structured JSON with bounding boxes and layout data for every block on the page. Use for custom pipelines, positional filtering, and debugging extraction output.

ToText()

Extract content as plain text, stripped of all Markdown syntax. Use for search indexing, NLP pipelines, and systems that render Markdown literally.

Layout and structure methods

ParseDocument()

Analyse the visual layout of a document and return a typed ParsedDocument object — pages, text blocks, tables, and image regions — with bounding boxes and reading order. The in-process equivalent of ToJson().

GetKeyValues()

Extract all interactive AcroForm field names, values, and page locations from a PDF. Use for structured data extraction from filled-in forms.

Reader types

PDFMarkdownReader

A LlamaIndex-compatible document reader. Created via PdfExtractor.LlamaMarkdownReader(). Loads a PDF and returns one LlamaDocument per page, each with Markdown text and metadata including page number and source file path.

Return types

ParsedDocument

Typed .NET object returned by ParseDocument(). Contains a list of ParsedPage objects, each with its blocks, tables, images, and dimensions.

FormField

Represents a single AcroForm field returned by GetKeyValues(). Exposes Name, Value, and Page properties.

Quick reference

Method / Type	Returns	Key parameters
`PdfExtractor.ToMarkdown()`	`string`	`pages`, `writeImages`, `embedImages`, `useOcr`, `ocrLanguage`, `forceText`
`PdfExtractor.ToJson()`	`string` (JSON)	`pages`, `showProgress`
`PdfExtractor.ToText()`	`string`	`pages`, `useOcr`, `ocrLanguage`, `forceText`
`PdfExtractor.ParseDocument()`	`ParsedDocument`	`pages`, `useOcr`, `ocrLanguage`
`PdfExtractor.GetKeyValues()`	`List<FormField>`	`doc` only — no `pages` parameter
`PdfExtractor.LlamaMarkdownReader()`	`PDFMarkdownReader`	—
`PDFMarkdownReader.LoadData()`	`List<LlamaDocument>`	`filePath`, `extraInfo`

Getting Started

Guides

Integrations

Reference

Extraction methods

ToMarkdown()

ToJson()

ToText()

Layout and structure methods

ParseDocument()

GetKeyValues()

Reader types

PDFMarkdownReader

Return types

ParsedDocument

FormField

Quick reference

Getting Started

Guides

Integrations

Reference

​Extraction methods

ToMarkdown()

ToJson()

ToText()

​Layout and structure methods

ParseDocument()

GetKeyValues()

​Reader types

PDFMarkdownReader

​Return types

ParsedDocument

FormField

​Quick reference

Extraction methods

Layout and structure methods

Reader types

Return types

Quick reference