ToJson() returns document content as structured data rather than a Markdown string. Every text block, image, and table on each page is represented as a JSON object with positional metadata attached.This makes it the right choice when you need to:
Build a custom rendering or post-processing pipeline
Access bounding box coordinates for text and image regions
Detect headings, bold text, or other styled elements programmatically
Pass structured layout data to a downstream model or search index
using PDF4LLM;string json = PdfExtractor.ToJson("document.pdf");
Spans are the most granular unit in the JSON output. Each span represents a run of text sharing the same font, size, and style flags. This lets you identify headings, bold text, and other styled elements programmatically:
using Newtonsoft.Json.Linq;using PDF4LLM;string json = PdfExtractor.ToJson("document.pdf");JToken root = JToken.Parse(json);JArray pages = root switch{ JArray arr => arr, JObject obj when obj["pages"] is JArray arr => arr, _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")};foreach (JObject page in pages){ foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>()) { foreach (JObject line in (box["textlines"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>()) { foreach (JObject span in (line["spans"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>()) { string text = span["text"]!.Value<string>()!; float size = span["size"]!.Value<float>(); int flags = span["flags"]!.Value<int>(); if (size >= 14) Console.WriteLine($"Heading candidate: \"{text}\" (size {size})"); if ((flags & 16) != 0) // bold flag Console.WriteLine($"Bold text: \"{text}\""); } } }}
flags = 20 on "This is bold" with font MinionPro-Bold20 = 16 + 4Consistent with bold + serifed text.So the extracted styling in plain English is:"Italic text." → italic"Hello World!" → regular"This is bold" → bold
using System.IO;using PDF4LLM;string json = PdfExtractor.ToJson("document.pdf");File.WriteAllText("output.json", json, System.Text.Encoding.UTF8);
Always specify System.Text.Encoding.UTF8 explicitly when writing to disk. The two-argument File.WriteAllText overload uses the platform default encoding, which may corrupt non-Latin characters such as accented letters, CJK characters, and symbols on Windows.