How do I install PDF4LLM for .NET?
Install via the .NET CLI:PackageReference directly to your .csproj:
MuPDF.NET is installed automatically as a dependency — you do not need to add it separately.
Which .NET targets are supported?
PDF4LLM targets .NET Standard 2.0, making it compatible with any framework that implements that standard:| Target framework | Supported |
|---|---|
| .NET 8.0 | ✓ |
| .NET 7.0 | ✓ |
| .NET 6.0 | ✓ |
| .NET 5.0 | ✓ |
| .NET Standard 2.0 | ✓ |
| .NET Framework 4.8 | ✓ |
| .NET Framework 4.7.2 | ✓ |
| .NET Framework 4.6.1 | ✓ |
How do I verify my installation?
Add ausing directive and call ToMarkdown on any PDF:
I’m seeing an “assembly with the same simple name” conflict. How do I fix it?
Your installed version ofMuPDF.NET already bundles PDF4LLM internally. Having both packages referenced simultaneously causes the conflict.
Remove the explicit PDF4LLM package reference and rely on the bundled version:
.csproj manually:
using PDF4LLM; and PdfExtractor.* work regardless of which package supplies the assembly.
A future release of MuPDF.NET will stop bundling PDF4LLM, allowing both packages to coexist without conflict.
How do I convert a PDF to Markdown?
Open aDocument and pass it to PdfExtractor.ToMarkdown():
What output formats are supported?
There are three extraction methods onPdfExtractor, all sharing a consistent interface:
| Method | Returns | Best for |
|---|---|---|
ToMarkdown() | string | LLM ingestion and RAG pipelines |
ToJson() | string (JSON) | Custom pipelines needing bounding boxes and layout metadata |
ToText() | string | Search indexing and NLP preprocessing |
How do I extract only specific pages?
Pass a zero-based list of page indices to thepages parameter. This works across all three extraction methods:
0, page 2 is 1, and so on.
What document formats are supported as input?
Standard formats — PDF, XPS, EPUB, MOBI, and more — are supported out of the box. See the Supported Formats guide for a full list of supported input and output formats.How do I analyse visual layout regions (columns, figures, sidebars)?
UsePdfExtractor.ParseDocument(). It analyses the document and returns a typed ParsedDocument object containing a list of ParsedPage objects, each with its detected blocks, tables, images, and bounding boxes in reading order.
How do I extract AcroForm field values from a filled PDF?
UsePdfExtractor.GetKeyValues(). It returns a List<FormField>, each with Name, Value, and Page properties:
Note:GetKeyValues()does not accept apagesparameter — it always processes the full document.
Does it handle scanned or image-based PDFs?
Yes, via Tesseract OCR. Unlike the Python library, OCR is not triggered automatically — you must opt in withuseOcr: true:
How do I install Tesseract for OCR?
Tesseract must be installed on the host system and available on thePATH. PDF4LLM does not bundle it.
Windows — Download the installer from UB Mannheim Tesseract builds and add the install directory (e.g. C:\Program Files\Tesseract-OCR) to your PATH.
macOS
PATH, you will get a TesseractNotFoundException at runtime.
How do I use OCR with a non-English language?
Pass a Tesseract language code toocrLanguage. The default is "eng". Combine multiple languages with a +:
eng (English), fra (French), deu (German), spa (Spanish), jpn (Japanese), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese).
How do I handle documents that mix scanned and native text pages?
Use a per-page native probe to identify which pages need OCR, then extract each set separately:< 50) to suit your documents — pages with only a page number or short heading will score low.
How do I run OCR in a Docker container?
Add Tesseract to yourDockerfile:
PATH inside the built image:
TESSDATA_PREFIX explicitly:
Does it integrate with LlamaIndex?
Yes. UsePdfExtractor.LlamaMarkdownReader() to get a PDFMarkdownReader instance, then call LoadData() to get one LlamaDocument per page with Markdown text and metadata:
How do I use PDF4LLM with Azure OpenAI?
Install the Azure OpenAI SDK alongside PDF4LLM:LlamaMarkdownReader and text-embedding-3-small, then retrieve by cosine similarity before passing context to a chat model. See the Azure OpenAI integration guide for full patterns including batched embedding, parallel OCR, multimodal image input, and Managed Identity auth.
How do I use Managed Identity instead of an API key for Azure OpenAI?
ReplaceAzureKeyCredential with DefaultAzureCredential:
DefaultAzureCredential works in Azure App Service, Azure Functions, AKS, and other managed environments. For local development, run az login first, or use AzureCliCredential explicitly.
Assign the Cognitive Services OpenAI User role to the managed identity in the Azure Portal or via the CLI: