Skip to main content

Input Formats

MuPDF.NET can open and extract content from the following document types:
FormatExtensionsNotes
PDF.pdfAll versions, including encrypted and scanned
XPS.xpsMicrosoft XML Paper Specification
eBooks.epub, .mobi, .fb2Reflowable content is linearised per chapter
Comic Books.cbzImage-based pages; OCR recommended

Output Formats

MuPDF.NET can produce output in four formats depending on your use case:
FormatFunctionBest For
MarkdownToMarkdown()LLM ingestion, RAG pipelines, readable docs
JSONToJson()Custom pipelines needing bounding boxes and layout data
Plain TextToText()Simple text extraction, search indexing
ImagesToMarkdown(writeImages: true)Preserving figures, charts, and diagrams

Markdown

The default and most commonly used output format. Text is extracted in reading order with headings, lists, tables, and inline formatting preserved where detectable.
string mdText = PdfExtractor.ToMarkdown("document.pdf");

JSON

Returns structured data including bounding boxes, font information, and layout metadata for every block on the page. Useful for building custom post-processing pipelines.
string json_output = PdfExtractor.ToJson("document.pdf");

Plain Text

Strips all formatting and returns raw text content. Ideal when downstream tools do not need Markdown syntax.
string text = PdfExtractor.ToText("document.pdf");

Images

When writeImages: true is passed to ToMarkdown(), embedded images and graphics are extracted and saved to disk. Image paths are referenced inline in the Markdown output.
string mdText = PdfExtractor.ToMarkdown("document.pdf", writeImages: true, imagePath: "images/");

Next Steps

Extract Markdown

Full walkthrough of ToMarkdown() with common options.

Images & Graphics

Controlling image extraction, DPI, and output path.