Overview

PdfExtractor is the static entry point for all PDF4LLM operations. Import the namespace and call methods directly — no instantiation required.

using PDF4LLM;

string markdown = PdfExtractor.ToMarkdown("document.pdf");

All extraction methods are available in two overloads: one that accepts an open MuPDF.NET.Document, and one that accepts a file path string. The file path overload opens and disposes the document internally. Use the Document overload when calling multiple methods on the same file to avoid re-parsing.

Properties

Version

public static string Version { get; }

Returns the package version string (semantic version).

Console.WriteLine(PdfExtractor.Version); // e.g. "1.27.2.3"

VersionTuple

public static (int major, int minor, int patch) VersionTuple { get; }

Returns the version as a named tuple of integers, useful for programmatic version checks.

var (major, minor, patch) = PdfExtractor.VersionTuple;
if (major < 2)
    Console.WriteLine("Consider upgrading PDF4LLM.");

ToMarkdown

Converts a PDF to a GitHub-compatible Markdown string. Headings, tables, bold/italic formatting, multi-column layouts, and images are all detected and rendered.

Signatures

public static string ToMarkdown(
    Document          doc,
    bool              header       = true,
    bool              footer       = true,
    List<int>?        pages        = null,
    bool              writeImages  = false,
    bool              embedImages  = false,
    string            imagePath    = "",
    string            imageFormat  = "png",
    string            filename     = "",
    bool              forceText    = true,
    bool              pageChunks   = false,
    bool              pageSeparators = false,
    int               dpi          = 150,
    int               ocrDpi       = 300,
    float             pageWidth    = 612,
    float?            pageHeight   = null,
    bool              ignoreCode   = false,
    bool              showProgress = false,
    bool              useOcr       = true,
    string            ocrLanguage  = "eng",
    bool              forceOcr     = false,
    OcrPageFunction?  ocrFunction  = null
);

public static string ToMarkdown(
    string            path,
    bool              header       = true,
    bool              footer       = true,
    List<int>?        pages        = null,
    bool              writeImages  = false,
    bool              embedImages  = false,
    string            imagePath    = "",
    string            imageFormat  = "png",
    string            filename     = "",
    bool              forceText    = true,
    bool              pageChunks   = false,
    bool              pageSeparators = false,
    int               dpi          = 150,
    int               ocrDpi       = 300,
    float             pageWidth    = 612,
    float?            pageHeight   = null,
    bool              ignoreCode   = false,
    bool              showProgress = false,
    bool              useOcr       = true,
    string            ocrLanguage  = "eng",
    bool              forceOcr     = false,
    OcrPageFunction?  ocrFunction  = null
);

Parameters

doc

MuPDF.NET.Document

required

An already-opened MuPDF.NET Document. Use this overload when calling multiple methods on the same file.

path

string

required

Path to a PDF file. Opens and disposes the document internally. Throws ArgumentException if null or whitespace.

header

bool

default:"true"

When true, repeating page header content is included in the output. Set to false to suppress headers that do not add extraction value — such as document titles repeated on every page.

footer

bool

default:"true"

When true, repeating page footer content is included in the output. Set to false to suppress footers such as page numbers and confidentiality notices.

pages

List<int>?

default:"null"

Zero-based page numbers to process. Pages are processed in the order supplied. null processes all pages in document order.

writeImages

bool

default:"false"

When true, images and vector graphic regions are extracted and saved to disk at imagePath. Markdown image references are inserted inline at the corresponding positions.Mutually exclusive with embedImages — passing both true throws ArgumentException.

If your document contains text rendered over full-page background images, leave writeImages as false to ensure the text is extracted rather than captured as part of the image.

embedImages

bool

default:"false"

When true, images are encoded as Base64 data URIs and embedded directly in the Markdown output. No files are written to disk. Mutually exclusive with writeImages — passing both true throws ArgumentException. May significantly increase the size of the output string for image-heavy documents.

imagePath

string

default:"\"\""

Directory in which to save extracted images when writeImages is true. Defaults to the process working directory. The directory must exist — it will not be created automatically.

imageFormat

string

default:"\"png\""

File format for extracted images, as a lowercase extension string. Common values: "png" (lossless, default), "jpg" (lossy, smaller), "webp", "tiff". All MuPDF-supported formats are accepted. Only relevant when writeImages or embedImages is true.

filename

string

default:"\"\""

Overrides the source filename used when naming extracted image files. Useful when the document is opened from a stream or memory buffer with no inherent filename. Defaults to doc.Name when empty.

forceText

bool

default:"true"

When true, extracts text from regions the layout engine classifies as pictures or image backgrounds, appending it after the image reference in the output. Useful for PDFs exported from presentation tools where text is layered over slide backgrounds.

pageChunks

bool

default:"false"

When true, the return value is a JSON string containing an array of per-page objects rather than a single concatenated Markdown string. Each object contains the page’s Markdown text and metadata. See Returns below.

pageSeparators

bool

default:"false"

When true, inserts a separator string between pages in the output. Intended for debugging — use pageChunks for production per-page output.

dpi

int

default:"150"

Resolution in dots per inch for rasterising images written to disk or embedded as Base64. Higher values increase image quality and file size. Only relevant when writeImages or embedImages is true.

ocrDpi

int

default:"300"

Resolution used when rendering a page to an intermediate image for OCR. Higher values may improve OCR accuracy but increase memory usage and processing time. The default of 300 is sufficient for most documents. Only relevant when useOcr or forceOcr is true.

pageWidth

float

default:"612"

Desired page width in points. Ignored for PDFs, which have fixed page dimensions. For reflowable documents (eBooks, plain text), this sets the virtual page width. Default assumes US Letter width (612 pt = 8.5 in).

pageHeight

float?

default:"null"

Desired page height in points for reflowable documents. When null, the document is treated as a single large page with no page separators.

ignoreCode

bool

default:"false"

When true, monospaced text is not given special formatting and no code blocks are generated. Useful for documents where monospaced fonts are used decoratively rather than for code.

showProgress

bool

default:"false"

When true, writes a per-page progress indicator to the console during processing.

useOcr

bool

default:"true"

When true, applies Tesseract OCR to pages that the layout engine determines would benefit from it — typically pages with little or no selectable text. Tesseract must be installed and on the PATH.

ocrLanguage

string

default:"\"eng\""

Tesseract language code. Use + to combine multiple languages: "eng+fra". The corresponding language data files must be installed. Only relevant when useOcr or forceOcr is true. See Tesseract Language Packs.

forceOcr

bool

default:"false"

When true, applies OCR to every page unconditionally, regardless of whether the page contains selectable text. Use for documents known to be entirely image-based. When false, OCR is applied only to pages the layout engine identifies as candidates.

ocrFunction

OcrPageFunction?

default:"null"

A custom OCR delegate to use in place of the built-in Tesseract engine. When null, Tesseract is used. See the OCR customisation documentation for the expected delegate signature.

Returns

When pageChunks is false (default): a single string of GitHub-compatible Markdown covering all processed pages, with pages separated by a horizontal rule (---). When pageChunks is true: a JSON string containing an array of per-page objects. Each object has the following shape:

{
  "page":     0,
  "text":     "# Page heading\n\nBody text...",
  "metadata": {
    "file_path":   "document.pdf",
    "page_count":  12,
    "page_number": 1
  }
}

Exceptions

Exception	Condition
`ArgumentException`	`path` is null or whitespace, or both `writeImages` and `embedImages` are `true`.
`FileNotFoundException`	`path` does not exist.
`DirectoryNotFoundException`	`imagePath` directory does not exist when `writeImages` is `true`.
`MuPdfException`	The file is not a valid PDF, is password-protected, or is corrupted.
`TesseractNotFoundException`	`useOcr` or `forceOcr` is `true` but Tesseract is not on the `PATH`.

ToJson

Exports the complete layout structure of a PDF as a JSON string. Every block on every processed page — text, tables, and images — is included with its type, content, and bounding box.

Signatures

public static string ToJson(
    Document          doc,
    int               imageDpi     = 150,
    string            imageFormat  = "png",
    string            imagePath    = "",
    List<int>?        pages        = null,
    int               ocrDpi       = 300,
    bool              writeImages  = false,
    bool              embedImages  = false,
    bool              showProgress = false,
    bool              forceText    = true,
    bool              useOcr       = true,
    string            ocrLanguage  = "eng",
    bool              forceOcr     = false,
    OcrPageFunction?  ocrFunction  = null
);

public static string ToJson(
    string            path,
    int               imageDpi     = 150,
    string            imageFormat  = "png",
    string            imagePath    = "",
    List<int>?        pages        = null,
    int               ocrDpi       = 300,
    bool              writeImages  = false,
    bool              embedImages  = false,
    bool              showProgress = false,
    bool              forceText    = true,
    bool              useOcr       = true,
    string            ocrLanguage  = "eng",
    bool              forceOcr     = false,
    OcrPageFunction?  ocrFunction  = null
);

Parameters

doc / path

MuPDF.NET.Document | string

required

An open Document or a path to a PDF file. The path overload opens and disposes the document internally.

imageDpi

int

default:"150"

Resolution for rasterised images written to disk or embedded as Base64. Only relevant when writeImages or embedImages is true.

imageFormat

string

default:"\"png\""

Format for extracted images. Accepts any MuPDF-supported extension: "png", "jpg", "webp", "tiff". Only relevant when writeImages or embedImages is true.

imagePath

string

default:"\"\""

Directory for saved images when writeImages is true. Must exist before calling ToJson.

pages

List<int>?

default:"null"

Zero-based page numbers to process. null processes all pages in document order.

ocrDpi

int

default:"300"

Resolution for OCR page rendering. Only relevant when useOcr or forceOcr is true.

writeImages

bool

default:"false"

When true, saves image and graphic regions to disk at imagePath.

embedImages

bool

default:"false"

When true, encodes image regions as Base64 and includes them in the JSON output.

showProgress

bool

default:"false"

When true, writes a per-page progress indicator to the console.

forceText

bool

default:"true"

When true, extracts text from regions classified as picture areas by the layout engine.

useOcr

bool

default:"true"

When true, applies OCR to pages the layout engine identifies as candidates.

ocrLanguage

string

default:"\"eng\""

Tesseract language code for OCR. Use + to combine multiple languages.

forceOcr

bool

default:"false"

When true, applies OCR to every page unconditionally.

ocrFunction

OcrPageFunction?

default:"null"

Custom OCR delegate. When null, Tesseract is used.

Returns

string — A JSON array, one object per processed page. See JSON schema for the full schema.

Exceptions

Exception	Condition
`ArgumentException`	`path` is null or whitespace.
`NotSupportedException`	`PdfExtractor.UseLayout` is `false`. `ToJson` requires layout mode.
`FileNotFoundException`	`path` does not exist.
`MuPdfException`	The file is not a valid or readable PDF.
`TesseractNotFoundException`	`useOcr` or `forceOcr` is `true` but Tesseract is not on the `PATH`.

ToText

Converts a PDF to plain text using the same layout analysis pipeline as ToMarkdown, without any Markdown syntax. Tables can be rendered in multiple formats.

Signatures

public static string ToText(
    Document          doc,
    string            filename         = "",
    bool              header           = true,
    bool              footer           = true,
    List<int>?        pages            = null,
    bool              ignoreCode       = false,
    bool              showProgress     = false,
    bool              forceText        = true,
    int               ocrDpi           = 300,
    bool              useOcr           = true,
    string            ocrLanguage      = "eng",
    string            tableFormat      = "grid",
    bool              pageChunks       = false,
    int               tableMaxWidth    = 100,
    int               tableMinColWidth = 10,
    bool              forceOcr         = false,
    OcrPageFunction?  ocrFunction      = null
);

public static string ToText(
    string            path,
    string            filename         = "",
    bool              header           = true,
    bool              footer           = true,
    List<int>?        pages            = null,
    bool              ignoreCode       = false,
    bool              showProgress     = false,
    bool              forceText        = true,
    int               ocrDpi           = 300,
    bool              useOcr           = true,
    string            ocrLanguage      = "eng",
    string            tableFormat      = "grid",
    bool              pageChunks       = false,
    int               tableMaxWidth    = 100,
    int               tableMinColWidth = 10,
    bool              forceOcr         = false,
    OcrPageFunction?  ocrFunction      = null
);

Parameters

doc / path

MuPDF.NET.Document | string

required

An open Document or a path to a PDF file. The path overload opens and disposes the document internally.

filename

string

default:"\"\""

Overrides the source filename. Useful when the document has no inherent filename.

header

bool

default:"true"

When true, includes repeating page header content. Set to false to suppress headers.

footer

bool

default:"true"

When true, includes repeating page footer content. Set to false to suppress footers.

pages

List<int>?

default:"null"

Zero-based page numbers to process. null processes all pages.

ignoreCode

bool

default:"false"

When true, monospaced text is not formatted as code blocks.

showProgress

bool

default:"false"

When true, writes a per-page progress indicator to the console.

forceText

bool

default:"true"

When true, extracts text from regions classified as picture areas by the layout engine.

ocrDpi

int

default:"300"

Resolution for OCR page rendering. Only relevant when useOcr or forceOcr is true.

useOcr

bool

default:"true"

When true, applies OCR to pages the layout engine identifies as candidates.

ocrLanguage

string

default:"\"eng\""

Tesseract language code for OCR. Use + to combine multiple languages.

tableFormat

string

default:"\"grid\""

Controls how detected tables are rendered in plain text output.

Value	Description
`"grid"`	ASCII grid with borders on all sides. Default.
`"plain"`	Whitespace-aligned columns, no borders.
`"pipe"`	Markdown-style pipe table (same as `ToMarkdown` output).

pageChunks

bool

default:"false"

When true, returns a JSON string containing an array of per-page objects rather than a single concatenated string.

tableMaxWidth

int

default:"100"

Maximum character width of the rendered table. Columns are wrapped or truncated to fit within this limit. Only relevant when tableFormat is "grid" or "plain".

tableMinColWidth

int

default:"10"

Minimum character width of any single table column. Prevents columns from collapsing to unusably narrow widths.

forceOcr

bool

default:"false"

When true, applies OCR to every page unconditionally.

ocrFunction

OcrPageFunction?

default:"null"

Custom OCR delegate. When null, Tesseract is used.

Returns

When pageChunks is false (default): a single string of plain text covering all processed pages. When pageChunks is true: a JSON string containing an array of per-page plain text objects.

Exceptions

Exception	Condition
`ArgumentException`	`path` is null or whitespace.
`NotSupportedException`	`PdfExtractor.UseLayout` is `false`. `ToText` requires layout mode.
`FileNotFoundException`	`path` does not exist.
`MuPdfException`	The file is not a valid or readable PDF.
`TesseractNotFoundException`	`useOcr` or `forceOcr` is `true` but Tesseract is not on the `PATH`.

ParseDocument

Parses a PDF and returns a typed ParsedDocument object — the full layout structure as a .NET object graph rather than a serialised string. The in-process equivalent of ToJson.

Signatures

public static ParsedDocument ParseDocument(
    Document          doc,
    string            filename     = "",
    int               imageDpi     = 150,
    string            imageFormat  = "png",
    string            imagePath    = "",
    int               ocrDpi       = 300,
    List<int>?        pages        = null,
    bool              writeImages  = false,
    bool              embedImages  = false,
    bool              showProgress = false,
    bool              forceText    = true,
    bool              useOcr       = true,
    string            ocrLanguage  = "eng",
    bool              forceOcr     = false,
    bool              keepOcrText  = false,
    OcrPageFunction?  ocrFunction  = null
);

ParseDocument does not have a file path overload — it always requires an open Document. Open the document with new Document(path) and close it when done.

Parameters

All parameters are identical to ToJson with two additions:

keepOcrText

bool

default:"false"

When true, preserves the raw Tesseract OCR output text alongside the layout analysis results in the returned ParsedDocument. Useful for debugging OCR quality on specific pages.

All other parameters (imageDpi, imageFormat, imagePath, ocrDpi, pages, writeImages, embedImages, showProgress, forceText, useOcr, ocrLanguage, forceOcr, ocrFunction) behave identically to their counterparts in ToJson.

Returns

ParsedDocument — A typed .NET object containing one ParsedPage per processed page.

Exceptions

Exception	Condition
`NotSupportedException`	`PdfExtractor.UseLayout` is `false`. `ParseDocument` requires layout mode.
`MuPdfException`	The document is not a valid or readable PDF.
`TesseractNotFoundException`	`useOcr` or `forceOcr` is `true` but Tesseract is not on the `PATH`.

LlamaMarkdownReader

Returns a PDFMarkdownReader instance that provides a LlamaIndex-compatible document loading interface. Each page of a loaded PDF is returned as a separate LlamaDocument with Markdown text and a metadata dictionary.

Signature

public static PDFMarkdownReader LlamaMarkdownReader(
    Func<Dictionary<string, object>, Dictionary<string, object>>? metaFilter = null
);

Parameters

metaFilter

Func<Dictionary<string, object>, Dictionary<string, object>>?

default:"null"

An optional delegate that receives each page’s metadata dictionary before it is attached to the LlamaDocument and returns a (potentially modified) replacement. Use to add, rename, or remove metadata fields for all pages in a single place.

var reader = PdfExtractor.LlamaMarkdownReader(metaFilter: meta =>
{
    meta["source"] = "annual-report-2024";
    meta.Remove("file_path"); // strip internal path
    return meta;
});

Returns

PDF4LLM.Llama.PDFMarkdownReader — Call .LoadData() on the result to load a PDF.

GetKeyValues

Extracts all interactive AcroForm field names, values, and page locations from a PDF. Returns an empty dictionary for PDFs that do not contain form fields.

Signatures

public static Dictionary<string, Dictionary<string, object>> GetKeyValues(
    Document doc,
    bool     xrefs = false
);

public static Dictionary<string, Dictionary<string, object>> GetKeyValues(
    Document                    doc,
    bool                        xrefs,
    IReadOnlyCollection<string> ignoredKeywordArgumentNames
);

Parameters

doc

MuPDF.NET.Document

required

An open MuPDF.NET Document. GetKeyValues does not have a file path overload.

xrefs

bool

default:"false"

When true, includes PDF cross-reference numbers (xref) for each field in the returned metadata. Useful for low-level PDF manipulation workflows that need to locate field objects by their internal identifier.

ignoredKeywordArgumentNames

IReadOnlyCollection<string>?

default:"null"

When provided, any names in this collection are logged to the console as a warning. This overload exists for compatibility with callers that pass unsupported keyword arguments — it delegates to the two-parameter overload after logging the warning.

Returns

Dictionary<string, Dictionary<string, object>> — A dictionary keyed by field name. Each value is a nested dictionary of field properties. Returns an empty dictionary if the document contains no AcroForm fields (doc.IsFormPDF == 0). The nested dictionary for each field typically contains:

Key	Type	Description
`"value"`	`string`	The field’s current value. Empty string if unfilled.
`"page"`	`int`	Zero-based page number where the field appears.
`"xref"`	`int`	PDF cross-reference number. Only present when `xrefs: true`.

Exceptions

Exception	Condition
`MuPdfException`	The document is not a valid or readable PDF.

Example

Document doc    = new Document("application.pdf");
var      fields = PdfExtractor.GetKeyValues(doc);
doc.Close();

foreach (var (name, props) in fields)
{
    string value = props["value"]?.ToString() ?? "";
    int    page  = (int)props["page"];
    Console.WriteLine($"[Page {page}] {name}: {value}");
}

Parameters quick reference

Parameter	`ToMarkdown`	`ToJson`	`ToText`	`ParseDocument`
`doc` / `path`	✓	✓	✓	`doc` only
`pages`	✓	✓	✓	✓
`header`	✓	✗	✓	✗
`footer`	✓	✗	✓	✗
`writeImages`	✓	✓	✗	✓
`embedImages`	✓	✓	✗	✓
`imagePath`	✓	✓	✗	✓
`imageFormat`	✓	✓	✗	✓
`dpi` / `imageDpi`	✓	✓	✗	✓
`ocrDpi`	✓	✓	✓	✓
`useOcr`	✓	✓	✓	✓
`forceOcr`	✓	✓	✓	✓
`ocrLanguage`	✓	✓	✓	✓
`ocrFunction`	✓	✓	✓	✓
`forceText`	✓	✓	✓	✓
`showProgress`	✓	✓	✓	✓
`filename`	✓	✗	✓	✓
`pageChunks`	✓	✗	✓	✗
`pageSeparators`	✓	✗	✗	✗
`ignoreCode`	✓	✗	✓	✗
`pageWidth`	✓	✗	✗	✗
`pageHeight`	✓	✗	✗	✗
`tableFormat`	✗	✗	✓	✗
`tableMaxWidth`	✗	✗	✓	✗
`tableMinColWidth`	✗	✗	✓	✗
`keepOcrText`	✗	✗	✗	✓

Getting Started

Guides

Integrations

Reference

​Overview

​Properties

​Version

​VersionTuple

​Methods

​ToMarkdown

​Signatures

​Parameters

​Returns

​Exceptions

​ToJson

​Signatures

​Parameters

​Returns

​Exceptions

​ToText

​Signatures

​Parameters

​Returns

​Exceptions

​ParseDocument

​Signatures

​Parameters

​Returns

​Exceptions

​LlamaMarkdownReader

​Signature

​Parameters

​Returns

​GetKeyValues

​Signatures

​Parameters

​Returns

​Exceptions

​Example

​Parameters quick reference

Overview

Properties

Version

VersionTuple

Methods

ToMarkdown

Signatures

Parameters

Returns

Exceptions

ToJson

Signatures

Parameters

Returns

Exceptions

ToText

Signatures

Parameters

Returns

Exceptions

ParseDocument

Signatures

Parameters

Returns

Exceptions

LlamaMarkdownReader

Signature

Parameters

Returns

GetKeyValues

Signatures

Parameters

Returns

Exceptions

Example

Parameters quick reference