Skip to main content

Overview

ToJson() returns document content as structured data rather than a Markdown string. Every text block, image, and table on each page is represented as a JSON object with positional metadata attached. This makes it the right choice when you need to:
  • Build a custom rendering or post-processing pipeline
  • Access bounding box coordinates for text and image regions
  • Detect headings, bold text, or other styled elements programmatically
  • Pass structured layout data to a downstream model or search index
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");

Output structure

The return value is a JSON array — one object per processed page. See the JSON schema guide for a full field reference.

Working with bounding boxes

Every block carries a bbox field — a four-element array [x0, y0, x1, y1] describing the rectangle that bounds that element.
using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

foreach (JObject page in pages)
{
    int pageNum = page["page_number"]!.Value<int>();
    Console.WriteLine($"\nPage {pageNum}");

    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        float x0 = box["x0"]!.Value<float>();
        float y0 = box["y0"]!.Value<float>();
        float x1 = box["x1"]!.Value<float>();
        float y1 = box["y1"]!.Value<float>();
        string boxClass = box["boxclass"]?.Value<string>() ?? "unknown";

        Console.WriteLine($"{boxClass} at ({x0:F1}, {y0:F1}) -> ({x1:F1}, {y1:F1})");
    }
}

Extracting span-level data

Spans are the most granular unit in the JSON output. Each span represents a run of text sharing the same font, size, and style flags. This lets you identify headings, bold text, and other styled elements programmatically:
using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

foreach (JObject page in pages)
{
    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        foreach (JObject line in (box["textlines"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
        {
            foreach (JObject span in (line["spans"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
            {
                string text = span["text"]!.Value<string>()!;
                float size = span["size"]!.Value<float>();
                int flags = span["flags"]!.Value<int>();

                if (size >= 14)
                    Console.WriteLine($"Heading candidate: \"{text}\" (size {size})");

                if ((flags & 16) != 0)   // bold flag
                    Console.WriteLine($"Bold text: \"{text}\"");
            }
        }
    }
}

Font flags reference

The flags field is a bitmask encoding font properties:
BitValueMeaning
01Superscript
12Italic
24Serifed font
38Monospaced font
416Bold

Example interpretation

Consider the following span JSON:
"spans": [
  {
    "size": 12,
    "flags": 6,
    "font": "MinionPro-It",
    "text": "Italic text.",
    "bbox": [72, 435.9, 122.6, 444.6]
  },
  {
    "size": 12,
    "flags": 0,
    "font": "Arial",
    "text": "Hello World!",
    "bbox": [122.6, 436.3, 184.8, 444.6]
  },
  {
    "size": 12,
    "flags": 20,
    "font": "MinionPro-Bold",
    "text": "This is bold",
    "bbox": [187.5, 436.0, 246.0, 444.6]
  }
]

flags = 6

flags = 6 on "Italic text." with font MinionPro-It 6 = 2 + 4 Consistent with italic + serifed text.

flags = 0

flags = 0 on "Hello World!" with font Arial 0 is consistent with regular (unstyled) text.

flags = 20

flags = 20 on "This is bold" with font MinionPro-Bold 20 = 16 + 4 Consistent with bold + serifed text. So the extracted styling in plain English is: "Italic text." → italic "Hello World!" → regular "This is bold" → bold

Page selection

As with ToMarkdown(), you can limit extraction to specific pages:
string json = PdfExtractor.ToJson("document.pdf", pages: new List<int> { 0, 1, 2 });

Saving JSON output

Write the result to a .json file:
using System.IO;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");

File.WriteAllText("output.json", json, System.Text.Encoding.UTF8);
Always specify System.Text.Encoding.UTF8 explicitly when writing to disk. The two-argument File.WriteAllText overload uses the platform default encoding, which may corrupt non-Latin characters such as accented letters, CJK characters, and symbols on Windows.

Full example — building a custom text pipeline

using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

static object ParseSpanFlags(int flags) => new
{
    Superscript = (flags & 1) != 0,
    Italic = (flags & 2) != 0,
    Serifed = (flags & 4) != 0,
    Monospaced = (flags & 8) != 0,
    Bold = (flags & 16) != 0,
};

foreach (JObject page in pages)
{
    int pageNum = page["page_number"]!.Value<int>();
    Console.WriteLine($"\nPage {pageNum}");

    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        foreach (JObject line in (box["textlines"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
        {
            foreach (JObject span in (line["spans"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
            {
                string text = span["text"]?.Value<string>() ?? "";
                int flags = span["flags"]?.Value<int>() ?? 0;
                var styles = ParseSpanFlags(flags);

                Console.WriteLine(new { text, flags, styles });
            }
        }
    }
}

For the full API signature, see the ToJson() API reference.

Next steps

JSON Schema

Full field descriptions for every object in the JSON output.

Extract Markdown

Preserve structure and formatting for LLM pipelines.

Extract Text

Get clean, plain text output.