Overview
PdfExtractor is the static entry point for all PDF4LLM operations. Import the namespace and call methods directly — no instantiation required.
MuPDF.NET.Document, and one that accepts a file path string. The file path overload opens and disposes the document internally. Use the Document overload when calling multiple methods on the same file to avoid re-parsing.
Properties
Version
VersionTuple
Methods
ToMarkdown
Converts a PDF to a GitHub-compatible Markdown string. Headings, tables, bold/italic formatting, multi-column layouts, and images are all detected and rendered.Signatures
Parameters
An already-opened MuPDF.NET
Document. Use this overload when calling multiple methods on the same file.Path to a PDF file. Opens and disposes the document internally. Throws
ArgumentException if null or whitespace.When
true, repeating page header content is included in the output. Set to false to suppress headers that do not add extraction value — such as document titles repeated on every page.When
true, repeating page footer content is included in the output. Set to false to suppress footers such as page numbers and confidentiality notices.Zero-based page numbers to process. Pages are processed in the order supplied.
null processes all pages in document order.When
true, images and vector graphic regions are extracted and saved to disk at imagePath. Markdown image references are inserted inline at the corresponding positions.Mutually exclusive with embedImages — passing both true throws ArgumentException.When
true, images are encoded as Base64 data URIs and embedded directly in the Markdown output. No files are written to disk. Mutually exclusive with writeImages — passing both true throws ArgumentException. May significantly increase the size of the output string for image-heavy documents.Directory in which to save extracted images when
writeImages is true. Defaults to the process working directory. The directory must exist — it will not be created automatically.File format for extracted images, as a lowercase extension string. Common values:
"png" (lossless, default), "jpg" (lossy, smaller), "webp", "tiff". All MuPDF-supported formats are accepted. Only relevant when writeImages or embedImages is true.Overrides the source filename used when naming extracted image files. Useful when the document is opened from a stream or memory buffer with no inherent filename. Defaults to
doc.Name when empty.When
true, extracts text from regions the layout engine classifies as pictures or image backgrounds, appending it after the image reference in the output. Useful for PDFs exported from presentation tools where text is layered over slide backgrounds.When
true, the return value is a JSON string containing an array of per-page objects rather than a single concatenated Markdown string. Each object contains the page’s Markdown text and metadata. See Returns below.When
true, inserts a separator string between pages in the output. Intended for debugging — use pageChunks for production per-page output.Resolution in dots per inch for rasterising images written to disk or embedded as Base64. Higher values increase image quality and file size. Only relevant when
writeImages or embedImages is true.Resolution used when rendering a page to an intermediate image for OCR. Higher values may improve OCR accuracy but increase memory usage and processing time. The default of
300 is sufficient for most documents. Only relevant when useOcr or forceOcr is true.Desired page width in points. Ignored for PDFs, which have fixed page dimensions. For reflowable documents (eBooks, plain text), this sets the virtual page width. Default assumes US Letter width (
612 pt = 8.5 in).Desired page height in points for reflowable documents. When
null, the document is treated as a single large page with no page separators.When
true, monospaced text is not given special formatting and no code blocks are generated. Useful for documents where monospaced fonts are used decoratively rather than for code.When
true, writes a per-page progress indicator to the console during processing.When
true, applies Tesseract OCR to pages that the layout engine determines would benefit from it — typically pages with little or no selectable text. Tesseract must be installed and on the PATH.Tesseract language code. Use
+ to combine multiple languages: "eng+fra". The corresponding language data files must be installed. Only relevant when useOcr or forceOcr is true. See Tesseract Language Packs.When
true, applies OCR to every page unconditionally, regardless of whether the page contains selectable text. Use for documents known to be entirely image-based. When false, OCR is applied only to pages the layout engine identifies as candidates.A custom OCR delegate to use in place of the built-in Tesseract engine. When
null, Tesseract is used. See the OCR customisation documentation for the expected delegate signature.Returns
WhenpageChunks is false (default): a single string of GitHub-compatible Markdown covering all processed pages, with pages separated by a horizontal rule (---).
When pageChunks is true: a JSON string containing an array of per-page objects. Each object has the following shape:
Exceptions
| Exception | Condition |
|---|---|
ArgumentException | path is null or whitespace, or both writeImages and embedImages are true. |
FileNotFoundException | path does not exist. |
DirectoryNotFoundException | imagePath directory does not exist when writeImages is true. |
MuPdfException | The file is not a valid PDF, is password-protected, or is corrupted. |
TesseractNotFoundException | useOcr or forceOcr is true but Tesseract is not on the PATH. |
ToJson
Exports the complete layout structure of a PDF as a JSON string. Every block on every processed page — text, tables, and images — is included with its type, content, and bounding box.Signatures
Parameters
An open
Document or a path to a PDF file. The path overload opens and disposes the document internally.Resolution for rasterised images written to disk or embedded as Base64. Only relevant when
writeImages or embedImages is true.Format for extracted images. Accepts any MuPDF-supported extension:
"png", "jpg", "webp", "tiff". Only relevant when writeImages or embedImages is true.Directory for saved images when
writeImages is true. Must exist before calling ToJson.Zero-based page numbers to process.
null processes all pages in document order.Resolution for OCR page rendering. Only relevant when
useOcr or forceOcr is true.When
true, saves image and graphic regions to disk at imagePath.When
true, encodes image regions as Base64 and includes them in the JSON output.When
true, writes a per-page progress indicator to the console.When
true, extracts text from regions classified as picture areas by the layout engine.When
true, applies OCR to pages the layout engine identifies as candidates.Tesseract language code for OCR. Use
+ to combine multiple languages.When
true, applies OCR to every page unconditionally.Custom OCR delegate. When
null, Tesseract is used.Returns
string — A JSON array, one object per processed page. See JSON schema for the full schema.
Exceptions
| Exception | Condition |
|---|---|
ArgumentException | path is null or whitespace. |
NotSupportedException | PdfExtractor.UseLayout is false. ToJson requires layout mode. |
FileNotFoundException | path does not exist. |
MuPdfException | The file is not a valid or readable PDF. |
TesseractNotFoundException | useOcr or forceOcr is true but Tesseract is not on the PATH. |
ToText
Converts a PDF to plain text using the same layout analysis pipeline asToMarkdown, without any Markdown syntax. Tables can be rendered in multiple formats.
Signatures
Parameters
An open
Document or a path to a PDF file. The path overload opens and disposes the document internally.Overrides the source filename. Useful when the document has no inherent filename.
When
true, includes repeating page header content. Set to false to suppress headers.When
true, includes repeating page footer content. Set to false to suppress footers.Zero-based page numbers to process.
null processes all pages.When
true, monospaced text is not formatted as code blocks.When
true, writes a per-page progress indicator to the console.When
true, extracts text from regions classified as picture areas by the layout engine.Resolution for OCR page rendering. Only relevant when
useOcr or forceOcr is true.When
true, applies OCR to pages the layout engine identifies as candidates.Tesseract language code for OCR. Use
+ to combine multiple languages.Controls how detected tables are rendered in plain text output.
| Value | Description |
|---|---|
"grid" | ASCII grid with borders on all sides. Default. |
"plain" | Whitespace-aligned columns, no borders. |
"pipe" | Markdown-style pipe table (same as ToMarkdown output). |
When
true, returns a JSON string containing an array of per-page objects rather than a single concatenated string.Maximum character width of the rendered table. Columns are wrapped or truncated to fit within this limit. Only relevant when
tableFormat is "grid" or "plain".Minimum character width of any single table column. Prevents columns from collapsing to unusably narrow widths.
When
true, applies OCR to every page unconditionally.Custom OCR delegate. When
null, Tesseract is used.Returns
WhenpageChunks is false (default): a single string of plain text covering all processed pages.
When pageChunks is true: a JSON string containing an array of per-page plain text objects.
Exceptions
| Exception | Condition |
|---|---|
ArgumentException | path is null or whitespace. |
NotSupportedException | PdfExtractor.UseLayout is false. ToText requires layout mode. |
FileNotFoundException | path does not exist. |
MuPdfException | The file is not a valid or readable PDF. |
TesseractNotFoundException | useOcr or forceOcr is true but Tesseract is not on the PATH. |
ParseDocument
Parses a PDF and returns a typedParsedDocument object — the full layout structure as a .NET object graph rather than a serialised string. The in-process equivalent of ToJson.
Signatures
Parameters
All parameters are identical toToJson with two additions:
When
true, preserves the raw Tesseract OCR output text alongside the layout analysis results in the returned ParsedDocument. Useful for debugging OCR quality on specific pages.imageDpi, imageFormat, imagePath, ocrDpi, pages, writeImages, embedImages, showProgress, forceText, useOcr, ocrLanguage, forceOcr, ocrFunction) behave identically to their counterparts in ToJson.
Returns
ParsedDocument — A typed .NET object containing one ParsedPage per processed page.
Exceptions
| Exception | Condition |
|---|---|
NotSupportedException | PdfExtractor.UseLayout is false. ParseDocument requires layout mode. |
MuPdfException | The document is not a valid or readable PDF. |
TesseractNotFoundException | useOcr or forceOcr is true but Tesseract is not on the PATH. |
LlamaMarkdownReader
Returns aPDFMarkdownReader instance that provides a LlamaIndex-compatible document loading interface. Each page of a loaded PDF is returned as a separate LlamaDocument with Markdown text and a metadata dictionary.
Signature
Parameters
An optional delegate that receives each page’s metadata dictionary before it is attached to the
LlamaDocument and returns a (potentially modified) replacement. Use to add, rename, or remove metadata fields for all pages in a single place.Returns
PDF4LLM.Llama.PDFMarkdownReader — Call .LoadData() on the result to load a PDF.
GetKeyValues
Extracts all interactive AcroForm field names, values, and page locations from a PDF. Returns an empty dictionary for PDFs that do not contain form fields.Signatures
Parameters
An open MuPDF.NET
Document. GetKeyValues does not have a file path overload.When
true, includes PDF cross-reference numbers (xref) for each field in the returned metadata. Useful for low-level PDF manipulation workflows that need to locate field objects by their internal identifier.When provided, any names in this collection are logged to the console as a warning. This overload exists for compatibility with callers that pass unsupported keyword arguments — it delegates to the two-parameter overload after logging the warning.
Returns
Dictionary<string, Dictionary<string, object>> — A dictionary keyed by field name. Each value is a nested dictionary of field properties.
Returns an empty dictionary if the document contains no AcroForm fields (doc.IsFormPDF == 0).
The nested dictionary for each field typically contains:
| Key | Type | Description |
|---|---|---|
"value" | string | The field’s current value. Empty string if unfilled. |
"page" | int | Zero-based page number where the field appears. |
"xref" | int | PDF cross-reference number. Only present when xrefs: true. |
Exceptions
| Exception | Condition |
|---|---|
MuPdfException | The document is not a valid or readable PDF. |
Example
Parameters quick reference
| Parameter | ToMarkdown | ToJson | ToText | ParseDocument |
|---|---|---|---|---|
doc / path | ✓ | ✓ | ✓ | doc only |
pages | ✓ | ✓ | ✓ | ✓ |
header | ✓ | ✗ | ✓ | ✗ |
footer | ✓ | ✗ | ✓ | ✗ |
writeImages | ✓ | ✓ | ✗ | ✓ |
embedImages | ✓ | ✓ | ✗ | ✓ |
imagePath | ✓ | ✓ | ✗ | ✓ |
imageFormat | ✓ | ✓ | ✗ | ✓ |
dpi / imageDpi | ✓ | ✓ | ✗ | ✓ |
ocrDpi | ✓ | ✓ | ✓ | ✓ |
useOcr | ✓ | ✓ | ✓ | ✓ |
forceOcr | ✓ | ✓ | ✓ | ✓ |
ocrLanguage | ✓ | ✓ | ✓ | ✓ |
ocrFunction | ✓ | ✓ | ✓ | ✓ |
forceText | ✓ | ✓ | ✓ | ✓ |
showProgress | ✓ | ✓ | ✓ | ✓ |
filename | ✓ | ✗ | ✓ | ✓ |
pageChunks | ✓ | ✗ | ✓ | ✗ |
pageSeparators | ✓ | ✗ | ✗ | ✗ |
ignoreCode | ✓ | ✗ | ✓ | ✗ |
pageWidth | ✓ | ✗ | ✗ | ✗ |
pageHeight | ✓ | ✗ | ✗ | ✗ |
tableFormat | ✗ | ✗ | ✓ | ✗ |
tableMaxWidth | ✗ | ✗ | ✓ | ✗ |
tableMinColWidth | ✗ | ✗ | ✓ | ✗ |
keepOcrText | ✗ | ✗ | ✗ | ✓ |