> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pdf4llm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract JSON

> Use [ToJson()](/dotnet/api/PdfExtractor#tojson) to get bounding boxes, layout data, and structured page content for custom pipelines.

<div id="apiIndicatorBadge">
  <div class="inner dotnet" />
</div>

## Overview

`ToJson()` returns document content as structured data rather than a Markdown string. Every text block, image, and table on each page is represented as a JSON object with positional metadata attached.

This makes it the right choice when you need to:

* Build a custom rendering or post-processing pipeline
* Access bounding box coordinates for text and image regions
* Detect headings, bold text, or other styled elements programmatically
* Pass structured layout data to a downstream model or search index

```csharp theme={null}
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
```

***

## Output structure

The return value is a JSON array — one object per processed page. See the [JSON schema guide](/dotnet/reference/JSON-schema) for a full field reference.

***

## Working with bounding boxes

Every block carries a `bbox` field — a four-element array `[x0, y0, x1, y1]` describing the rectangle that bounds that element.

```csharp theme={null}
using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

foreach (JObject page in pages)
{
    int pageNum = page["page_number"]!.Value<int>();
    Console.WriteLine($"\nPage {pageNum}");

    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        float x0 = box["x0"]!.Value<float>();
        float y0 = box["y0"]!.Value<float>();
        float x1 = box["x1"]!.Value<float>();
        float y1 = box["y1"]!.Value<float>();
        string boxClass = box["boxclass"]?.Value<string>() ?? "unknown";

        Console.WriteLine($"{boxClass} at ({x0:F1}, {y0:F1}) -> ({x1:F1}, {y1:F1})");
    }
}
```

***

## Extracting span-level data

Spans are the most granular unit in the JSON output. Each span represents a run of text sharing the same font, size, and style flags. This lets you identify headings, bold text, and other styled elements programmatically:

```csharp theme={null}
using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

foreach (JObject page in pages)
{
    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        foreach (JObject line in (box["textlines"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
        {
            foreach (JObject span in (line["spans"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
            {
                string text = span["text"]!.Value<string>()!;
                float size = span["size"]!.Value<float>();
                int flags = span["flags"]!.Value<int>();

                if (size >= 14)
                    Console.WriteLine($"Heading candidate: \"{text}\" (size {size})");

                if ((flags & 16) != 0)   // bold flag
                    Console.WriteLine($"Bold text: \"{text}\"");
            }
        }
    }
}
```

### Font flags reference

The `flags` field is a bitmask encoding font properties:

| Bit | Value | Meaning         |
| --- | ----- | --------------- |
| 0   | `1`   | Superscript     |
| 1   | `2`   | Italic          |
| 2   | `4`   | Serifed font    |
| 3   | `8`   | Monospaced font |
| 4   | `16`  | Bold            |

#### Example interpretation

Consider the following span JSON:

```json theme={null}
"spans": [
  {
    "size": 12,
    "flags": 6,
    "font": "MinionPro-It",
    "text": "Italic text.",
    "bbox": [72, 435.9, 122.6, 444.6]
  },
  {
    "size": 12,
    "flags": 0,
    "font": "Arial",
    "text": "Hello World!",
    "bbox": [122.6, 436.3, 184.8, 444.6]
  },
  {
    "size": 12,
    "flags": 20,
    "font": "MinionPro-Bold",
    "text": "This is bold",
    "bbox": [187.5, 436.0, 246.0, 444.6]
  }
]
```

#### flags = 6

`flags = 6` on `"Italic text."` with font `MinionPro-It`

`6 = 2 + 4`

Consistent with italic + serifed text.

#### flags = 0

`flags = 0` on `"Hello World!"` with font `Arial`

`0` is consistent with regular (unstyled) text.

#### flags = 20

`flags = 20` on `"This is bold"` with font `MinionPro-Bold`

`20 = 16 + 4`

Consistent with bold + serifed text.

So the extracted styling in plain English is:

`"Italic text."` → italic

`"Hello World!"` → regular

`"This is bold"` → bold

***

## Page selection

As with [`ToMarkdown()`](/dotnet/api/PdfExtractor#tomarkdown), you can limit extraction to specific pages:

```csharp theme={null}
string json = PdfExtractor.ToJson("document.pdf", pages: new List<int> { 0, 1, 2 });
```

***

## Saving JSON output

Write the result to a `.json` file:

```csharp theme={null}
using System.IO;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");

File.WriteAllText("output.json", json, System.Text.Encoding.UTF8);
```

<Tip>
  Always specify `System.Text.Encoding.UTF8` explicitly when writing to disk. The two-argument `File.WriteAllText` overload uses the platform default encoding, which may corrupt non-Latin characters such as accented letters, CJK characters, and symbols on Windows.
</Tip>

***

## Full example — building a custom text pipeline

```csharp theme={null}
using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

static object ParseSpanFlags(int flags) => new
{
    Superscript = (flags & 1) != 0,
    Italic = (flags & 2) != 0,
    Serifed = (flags & 4) != 0,
    Monospaced = (flags & 8) != 0,
    Bold = (flags & 16) != 0,
};

foreach (JObject page in pages)
{
    int pageNum = page["page_number"]!.Value<int>();
    Console.WriteLine($"\nPage {pageNum}");

    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        foreach (JObject line in (box["textlines"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
        {
            foreach (JObject span in (line["spans"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
            {
                string text = span["text"]?.Value<string>() ?? "";
                int flags = span["flags"]?.Value<int>() ?? 0;
                var styles = ParseSpanFlags(flags);

                Console.WriteLine(new { text, flags, styles });
            }
        }
    }
}
```

***

<Note>
  For the full API signature, see the [ToJson() API reference](/dotnet/api/PdfExtractor#tojson").
</Note>

***

## Next steps

<CardGroup cols={2}>
  <Card title="JSON Schema" icon="file-code" href="/dotnet/reference/JSON-schema">
    Full field descriptions for every object in the JSON output.
  </Card>

  <Card title="Extract Markdown" icon="markdown" href="/dotnet/guides/extract-Markdown">
    Preserve structure and formatting for LLM pipelines.
  </Card>

  <Card title="Extract Text" icon="text" href="/dotnet/guides/extract-Text">
    Get clean, plain text output.
  </Card>
</CardGroup>
