ToMarkdown() is the primary extraction method in PDF4LLM. It reads a document and returns its content as a Markdown string, preserving headings, lists, tables, code blocks, images, and reading order as closely as possible.
using MuPDF.NET;using PDF4LLM;string mdText = PdfExtractor.ToMarkdown("document.pdf");
Use LlamaMarkdownReader to return one document object per page instead of a single concatenated string. Each chunk includes the page’s Markdown text and associated metadata:
var reader = PdfExtractor.LlamaMarkdownReader();var chunks = reader.LoadData("document.pdf");foreach (var chunk in chunks){ int page = (int)chunk.ExtraInfo["page"]; string text = chunk.Text; Console.WriteLine($"Page {page}"); Console.WriteLine(text);}
PDF4LLM uses bounding box position to identify and exclude repeating page headers and footers. Filter them by building the page list and using ToJson to identify the margin bands, or exclude them at the chunking stage by filtering short leading and trailing lines from each page chunk.For documents with consistent header and footer heights, the most reliable approach is to filter blocks by their bounding box position using ParseDocument:
ParsedDocument parsed = PdfExtractor.ParseDocument("document.pdf");foreach (ParsedPage page in parsed.Pages){ // Exclude blocks in the top and bottom 60pt margin bands var bodyBlocks = page.Blocks .Where(b => b.BoundingBox.Y0 > 60 && b.BoundingBox.Y1 < (page.Height - 60)) .ToList(); // Render body blocks only}
using MuPDF.NET;using PDF4LLM;using System.IO;// Ensure image output directory existsDirectory.CreateDirectory("assets/");// Extract the first five pages with imagesstring mdText = PdfExtractor.ToMarkdown( "report.pdf", pages: new List<int> { 0, 1, 2, 3, 4 }, // first five pages only writeImages: true, // extract images to disk imagePath: "assets/", // image output directory imageFormat: "png" // image format);// Save the full output as a single Markdown fileFile.WriteAllText("output/report.md", mdText, System.Text.Encoding.UTF8);
To save each page as a separate file, use LlamaMarkdownReader for per-page output:
var reader = PdfExtractor.LlamaMarkdownReader();var chunks = reader.LoadData("report.pdf");Directory.CreateDirectory("output");foreach (var chunk in chunks){ int pageNum = (int)chunk.ExtraInfo["page"]; string filePath = $"output/page-{pageNum}.md"; File.WriteAllText(filePath, chunk.Text, System.Text.Encoding.UTF8);}
For the full API signature including all parameters and return types, see the ToMarkdown() API reference.