Skip to main content

Extraction methods

The three primary extraction methods share a common interface — they all accept a file path string or an open MuPDF.NET.Document, support the pages parameter for partial extraction, and return a string you can write directly to disk or pass downstream.

ToMarkdown()

Extract content as a GitHub-compatible Markdown string. The primary method for LLM ingestion and RAG pipelines. Supports image extraction, OCR, and per-page output via LlamaMarkdownReader.

ToJson()

Extract content as structured JSON with bounding boxes and layout data for every block on the page. Use for custom pipelines, positional filtering, and debugging extraction output.

ToText()

Extract content as plain text, stripped of all Markdown syntax. Use for search indexing, NLP pipelines, and systems that render Markdown literally.

Layout and structure methods

ParseDocument()

Analyse the visual layout of a document and return a typed ParsedDocument object — pages, text blocks, tables, and image regions — with bounding boxes and reading order. The in-process equivalent of ToJson().

GetKeyValues()

Extract all interactive AcroForm field names, values, and page locations from a PDF. Use for structured data extraction from filled-in forms.

Reader types

PDFMarkdownReader

A LlamaIndex-compatible document reader. Created via PdfExtractor.LlamaMarkdownReader(). Loads a PDF and returns one LlamaDocument per page, each with Markdown text and metadata including page number and source file path.

Return types

ParsedDocument

Typed .NET object returned by ParseDocument(). Contains a list of ParsedPage objects, each with its blocks, tables, images, and dimensions.

FormField

Represents a single AcroForm field returned by GetKeyValues(). Exposes Name, Value, and Page properties.

Quick reference

Method / TypeReturnsKey parameters
PdfExtractor.ToMarkdown()stringpages, writeImages, embedImages, useOcr, ocrLanguage, forceText
PdfExtractor.ToJson()string (JSON)pages, showProgress
PdfExtractor.ToText()stringpages, useOcr, ocrLanguage, forceText
PdfExtractor.ParseDocument()ParsedDocumentpages, useOcr, ocrLanguage
PdfExtractor.GetKeyValues()List<FormField>doc only — no pages parameter
PdfExtractor.LlamaMarkdownReader()PDFMarkdownReader
PDFMarkdownReader.LoadData()List<LlamaDocument>filePath, extraInfo