Extract JSON

Overview

ToJson() returns document content as structured data rather than a Markdown string. Every text block, image, and table on each page is represented as a JSON object with positional metadata attached. This makes it the right choice when you need to:

Build a custom rendering or post-processing pipeline
Access bounding box coordinates for text and image regions
Detect headings, bold text, or other styled elements programmatically
Pass structured layout data to a downstream model or search index

using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");

Output structure

The return value is a JSON array — one object per processed page. See the JSON schema guide for a full field reference.

Working with bounding boxes

Every block carries a bbox field — a four-element array [x0, y0, x1, y1] describing the rectangle that bounds that element.

using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

foreach (JObject page in pages)
{
    int pageNum = page["page_number"]!.Value<int>();
    Console.WriteLine($"\nPage {pageNum}");

    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        float x0 = box["x0"]!.Value<float>();
        float y0 = box["y0"]!.Value<float>();
        float x1 = box["x1"]!.Value<float>();
        float y1 = box["y1"]!.Value<float>();
        string boxClass = box["boxclass"]?.Value<string>() ?? "unknown";

        Console.WriteLine($"{boxClass} at ({x0:F1}, {y0:F1}) -> ({x1:F1}, {y1:F1})");
    }
}

Extracting span-level data

Spans are the most granular unit in the JSON output. Each span represents a run of text sharing the same font, size, and style flags. This lets you identify headings, bold text, and other styled elements programmatically:

using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

foreach (JObject page in pages)
{
    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        foreach (JObject line in (box["textlines"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
        {
            foreach (JObject span in (line["spans"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
            {
                string text = span["text"]!.Value<string>()!;
                float size = span["size"]!.Value<float>();
                int flags = span["flags"]!.Value<int>();

                if (size >= 14)
                    Console.WriteLine($"Heading candidate: \"{text}\" (size {size})");

                if ((flags & 16) != 0)   // bold flag
                    Console.WriteLine($"Bold text: \"{text}\"");
            }
        }
    }
}

Font flags reference

The flags field is a bitmask encoding font properties:

Bit	Value	Meaning
0	`1`	Superscript
1	`2`	Italic
2	`4`	Serifed font
3	`8`	Monospaced font
4	`16`	Bold

Example interpretation

Consider the following span JSON:

"spans": [
  {
    "size": 12,
    "flags": 6,
    "font": "MinionPro-It",
    "text": "Italic text.",
    "bbox": [72, 435.9, 122.6, 444.6]
  },
  {
    "size": 12,
    "flags": 0,
    "font": "Arial",
    "text": "Hello World!",
    "bbox": [122.6, 436.3, 184.8, 444.6]
  },
  {
    "size": 12,
    "flags": 20,
    "font": "MinionPro-Bold",
    "text": "This is bold",
    "bbox": [187.5, 436.0, 246.0, 444.6]
  }
]

flags = 6

flags = 6 on "Italic text." with font MinionPro-It 6 = 2 + 4 Consistent with italic + serifed text.

flags = 0

flags = 0 on "Hello World!" with font Arial 0 is consistent with regular (unstyled) text.

flags = 20

flags = 20 on "This is bold" with font MinionPro-Bold 20 = 16 + 4 Consistent with bold + serifed text. So the extracted styling in plain English is: "Italic text." → italic "Hello World!" → regular "This is bold" → bold

Page selection

As with ToMarkdown(), you can limit extraction to specific pages:

string json = PdfExtractor.ToJson("document.pdf", pages: new List<int> { 0, 1, 2 });

Saving JSON output

Write the result to a .json file:

using System.IO;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");

File.WriteAllText("output.json", json, System.Text.Encoding.UTF8);

Always specify System.Text.Encoding.UTF8 explicitly when writing to disk. The two-argument File.WriteAllText overload uses the platform default encoding, which may corrupt non-Latin characters such as accented letters, CJK characters, and symbols on Windows.

Full example — building a custom text pipeline

using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

static object ParseSpanFlags(int flags) => new
{
    Superscript = (flags & 1) != 0,
    Italic = (flags & 2) != 0,
    Serifed = (flags & 4) != 0,
    Monospaced = (flags & 8) != 0,
    Bold = (flags & 16) != 0,
};

foreach (JObject page in pages)
{
    int pageNum = page["page_number"]!.Value<int>();
    Console.WriteLine($"\nPage {pageNum}");

    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        foreach (JObject line in (box["textlines"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
        {
            foreach (JObject span in (line["spans"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
            {
                string text = span["text"]?.Value<string>() ?? "";
                int flags = span["flags"]?.Value<int>() ?? 0;
                var styles = ParseSpanFlags(flags);

                Console.WriteLine(new { text, flags, styles });
            }
        }
    }
}

For the full API signature, see the ToJson() API reference.

Next steps

JSON Schema

Full field descriptions for every object in the JSON output.

Extract Markdown

Preserve structure and formatting for LLM pipelines.

Extract Text

Get clean, plain text output.

Getting Started

Guides

Integrations

Reference

Overview

Output structure

Working with bounding boxes

Extracting span-level data

Font flags reference

Example interpretation

flags = 6

flags = 0

flags = 20

Page selection

Saving JSON output

Full example — building a custom text pipeline

Next steps

JSON Schema

Extract Markdown

Extract Text

Getting Started

Guides

Integrations

Reference

​Overview

​Output structure

​Working with bounding boxes

​Extracting span-level data

​Font flags reference

​Example interpretation

​flags = 6

​flags = 0

​flags = 20

​Page selection

​Saving JSON output

​Full example — building a custom text pipeline

​Next steps

JSON Schema

Extract Markdown

Extract Text

Overview

Output structure

Working with bounding boxes

Extracting span-level data

Font flags reference

Example interpretation

flags = 6

flags = 0

flags = 20

Page selection

Saving JSON output

Full example — building a custom text pipeline

Next steps