Input Formats
MuPDF.NET can open and extract content from the following document types:| Format | Extensions | Notes |
|---|---|---|
.pdf | All versions, including encrypted and scanned | |
| XPS | .xps | Microsoft XML Paper Specification |
| eBooks | .epub, .mobi, .fb2 | Reflowable content is linearised per chapter |
| Comic Books | .cbz | Image-based pages; OCR recommended |
Output Formats
MuPDF.NET can produce output in four formats depending on your use case:| Format | Function | Best For |
|---|---|---|
| Markdown | ToMarkdown() | LLM ingestion, RAG pipelines, readable docs |
| JSON | ToJson() | Custom pipelines needing bounding boxes and layout data |
| Plain Text | ToText() | Simple text extraction, search indexing |
| Images | ToMarkdown(writeImages: true) | Preserving figures, charts, and diagrams |
Markdown
The default and most commonly used output format. Text is extracted in reading order with headings, lists, tables, and inline formatting preserved where detectable.JSON
Returns structured data including bounding boxes, font information, and layout metadata for every block on the page. Useful for building custom post-processing pipelines.Plain Text
Strips all formatting and returns raw text content. Ideal when downstream tools do not need Markdown syntax.Images
WhenwriteImages: true is passed to ToMarkdown(), embedded images and graphics are extracted and saved to disk. Image paths are referenced inline in the Markdown output.
Next Steps
Extract Markdown
Full walkthrough of
ToMarkdown() with common options.Images & Graphics
Controlling image extraction, DPI, and output path.