Skip to main content

Overview

ToJson() returns a JSON string representing a single parsed PDF — its pages, layout boxes, text content, tables, images, and metadata. Deserialise it with your preferred library to traverse the hierarchy. PDF4LLM JSON Schema Diagram
using Newtonsoft.Json.Linq;
using PDF4LLM;

string  json = PdfExtractor.ToJson("document.pdf");
JObject root = JObject.Parse(json);
This page documents every object and field in the output hierarchy.
Positional coordinates are in PDF points (1 point = 1/72 inch). The origin (0, 0) is the top-left corner of the page.
{
  "filename": "hello-world.pdf",
  "page_count": 2,
  "toc": [],
  "pages": [
    {
      "page_number": 1,
      "width": 595.2,
      "height": 841.92,
      "boxes": [
        {
          "x0": 72,
          "y0": 72,
          "x1": 334.47,
          "y1": 273.38,
          "boxclass": "picture",
          "image": "images/hello-world.pdf-0001-00.png",
          "table": null,
          "textlines": []
        },
        {
          "x0": 70.69,
          "y0": 295.88,
          "x1": 197.28,
          "y1": 304.63,
          "boxclass": "text",
          "image": null,
          "table": null,
          "textlines": [
            {
              "bbox": [70.69, 295.88, 197.28, 304.63],
              "spans": [
                {
                  "size": 12,
                  "flags": 0,
                  "font": "Arial",
                  "color": 0,
                  "alpha": 255,
                  "text": "Hello World!",
                  "origin": [70.69, 304.47],
                  "bbox": [70.69, 295.88, 136.09, 304.61],
                  "line": 0,
                  "block": 0,
                  "dir": [1, 0]
                },
                {
                  "size": 12,
                  "flags": 20,
                  "font": "MinionPro-Bold",
                  "color": 0,
                  "alpha": 255,
                  "text": "This is bold",
                  "origin": [138.83, 304.47],
                  "bbox": [138.83, 296.03, 197.28, 304.63],
                  "line": 0,
                  "block": 0,
                  "dir": [1, 0]
                }
              ]
            }
          ]
        }
      ],
      "full_ocred": false,
      "text_ocred": false,
      "fulltext": [...],
      "words": [],
      "links": []
    },
    {
      "page_number": 2,
      "width": 595.2,
      "height": 841.92,
      "boxes": [
        {
          "x0": 72,
          "y0": 72,
          "x1": 524,
          "y1": 118,
          "boxclass": "table",
          "image": null,
          "table": {
            "bbox": [71.15, 72.19, 523.22, 117.68],
            "row_count": 3,
            "col_count": 4,
            "cells": [
              [[71.15, 72.19, 184.6, 87.36], [184.6, 72.19, 297.16, 87.36], ...],
              ...
            ],
            "extract": [
              ["A",  "B",  "C",  "D" ],
              ["A1", "B1", "C1", "D1"],
              ["A2", "B2", "C2", "D2"]
            ],
            "markdown": "|A|B|C|D|\n|---|---|---|---|\n|A1|B1|C1|D1|\n|A2|B2|C2|D2|\n\n"
          },
          "textlines": null
        }
      ],
      "full_ocred": false,
      "text_ocred": false,
      "fulltext": [...],
      "words": [],
      "links": []
    }
  ],
  "metadata": {
    "format": "PDF 1.6",
    "title": "",
    "author": "",
    "subject": "",
    "keywords": "",
    "creator": "",
    "producer": "",
    "creationDate": "D:20240722172345Z",
    "modDate": "D:20260318153118Z",
    "trapped": "",
    "encryption": null
  }
}

Root object

The top-level object returned for every extraction.
{
  "filename": "hello-world.pdf",
  "page_count": 2,
  "toc": [],
  "pages": [...],
  "metadata": {...}
}
filename
string
The name of the source PDF file that was parsed.
page_count
number
Total number of pages in the PDF.
toc
array
Table of contents entries extracted from the PDF. Each entry is an array of [page_index, title, page_number]. Empty when the PDF has no bookmarks or outline.
pages
array
Array of page objects, one per page in the PDF.
metadata
object
PDF document metadata. See metadata object.

Accessing the root in C#

JObject root      = JObject.Parse(PdfExtractor.ToJson("document.pdf"));
string  filename  = root["filename"]!.Value<string>()!;
int     pageCount = root["page_count"]!.Value<int>();
JArray  pages     = (JArray)root["pages"]!;
JArray  toc       = (JArray)root["toc"]!;

Page object

Represents a single page of the PDF. Found in pages[].
{
  "page_number": 1,
  "width": 595.2,
  "height": 841.92,
  "boxes": [...],
  "fulltext": [...],
  "full_ocred": false,
  "text_ocred": false,
  "words": [],
  "links": []
}
page_number
number
1-based index of this page within the document.
width
number
Page width in PDF points. A standard A4 page is 595.28 pt wide.
height
number
Page height in PDF points. A standard A4 page is 841.89 pt tall.
boxes
array
Detected content regions on the page. Each entry is a box object. Boxes are classified as "text", "picture", or "table".
fulltext
array
Raw text blocks extracted directly from the PDF’s content stream, independent of the layout box structure. Each entry is a fulltext block. Reflects the logical reading order as encoded in the PDF’s internal stream.
full_ocred
boolean
true if the entire page was processed through OCR because no native text layer was found.
text_ocred
boolean
true if individual text regions on the page were OCR’d rather than extracted natively.
words
array
Word-level bounding boxes. Empty in the default output; populated when extractWords is enabled.
Hyperlinks found on the page. Empty when no links are present.

Iterating pages in C#

JObject root  = JObject.Parse(PdfExtractor.ToJson("document.pdf"));
JArray  pages = (JArray)root["pages"]!;

foreach (JObject page in pages)
{
    int    pageNumber = page["page_number"]!.Value<int>();
    double width      = page["width"]!.Value<double>();
    double height     = page["height"]!.Value<double>();
    bool   wasOcred   = page["full_ocred"]!.Value<bool>();

    Console.WriteLine($"Page {pageNumber} ({width}×{height}pt, OCR: {wasOcred})");
}

Box object

A detected content region on a page. Found in pages[].boxes[]. Boxes are the primary layout unit. Each box covers a rectangular area and is classified into one of three types: "text", "picture", or "table". Which fields are populated depends on boxclass.
{
  "x0": 72,
  "y0": 72,
  "x1": 334.47,
  "y1": 273.38,
  "boxclass": "picture",
  "image": "images/hello-world.pdf-0001-00.png",
  "table": null,
  "textlines": []
}
{
  "x0": 70.69,
  "y0": 295.88,
  "x1": 197.28,
  "y1": 304.63,
  "boxclass": "text",
  "image": null,
  "table": null,
  "textlines": [...]
}
{
  "x0": 72,
  "y0": 72,
  "x1": 524,
  "y1": 118,
  "boxclass": "table",
  "image": null,
  "table": {...},
  "textlines": null
}
x0
number
Left edge of the box in PDF points, measured from the left of the page.
y0
number
Top edge of the box in PDF points, measured from the top of the page.
x1
number
Right edge of the box in PDF points.
y1
number
Bottom edge of the box in PDF points.
boxclass
string
Classification of the content region. One of:
  • "text" — contains text lines and spans
  • "picture" — contains an embedded image or graphic
  • "table" — contains a detected table structure
image
string | null
Relative path to the extracted image file when boxclass is "picture". null for all other box types.
table
object | null
A table object when boxclass is "table". null for all other box types.
textlines
array | null
Array of textline objects when boxclass is "text". Empty array [] for picture boxes. null for table boxes.

Iterating boxes by type in C#

foreach (JObject page in pages)
{
    foreach (JObject box in page["boxes"]!)
    {
        string boxclass = box["boxclass"]!.Value<string>()!;

        switch (boxclass)
        {
            case "text":
                foreach (JObject line in box["textlines"]!)
                    foreach (JObject span in line["spans"]!)
                        Console.WriteLine(span["text"]!.Value<string>());
                break;

            case "picture":
                string? imagePath = box["image"]?.Value<string>();
                Console.WriteLine($"Image: {imagePath}");
                break;

            case "table":
                var rows = box["table"]!["extract"]!
                    .ToObject<List<List<string>>>()!;
                Console.WriteLine($"Table: {rows.Count} rows");
                break;
        }
    }
}

Table object

Structured data for a detected table. Found in boxes[].table when boxclass is "table".
{
  "bbox": [71.15, 72.19, 523.22, 117.68],
  "row_count": 3,
  "col_count": 4,
  "cells": [
    [[71.15, 72.19, 184.6, 87.36], [184.6, 72.19, 297.16, 87.36], ...],
    ...
  ],
  "extract": [
    ["A",  "B",  "C",  "D" ],
    ["A1", "B1", "C1", "D1"],
    ["A2", "B2", "C2", "D2"]
  ],
  "markdown": "|A|B|C|D|\n|---|---|---|---|\n|A1|B1|C1|D1|\n|A2|B2|C2|D2|\n\n"
}
bbox
number[4]
Bounding box of the entire table as [x0, y0, x1, y1] in PDF points.
row_count
number
Number of rows in the table, including any header row.
col_count
number
Number of columns in the table.
cells
array
A 3D array of cell bounding boxes. cells[row][col] gives [x0, y0, x1, y1] for that cell in PDF points. Useful for mapping extracted text back to exact positions on the page.
extract
array
A 2D array of cell text values. extract[row][col] gives the string content of that cell. The first row is typically the header row.
markdown
string
The table pre-rendered as a Markdown pipe table string, ready for display or further processing.

Accessing table data in C#

JObject tableObj = (JObject)box["table"]!;
int     rowCount = tableObj["row_count"]!.Value<int>();
int     colCount = tableObj["col_count"]!.Value<int>();

var rows = tableObj["extract"]!.ToObject<List<List<string>>>()!;

// Header row
Console.WriteLine(string.Join(" | ", rows[0]));

// Data rows
foreach (var row in rows.Skip(1))
    Console.WriteLine(string.Join(" | ", row));

// Pre-rendered Markdown
string md = tableObj["markdown"]!.Value<string>()!;

Textline object

A single line of text within a text box. Found in boxes[].textlines[].
{
  "bbox": [70.69, 295.88, 197.28, 304.63],
  "spans": [...]
}
bbox
number[4]
Bounding box of this text line as [x0, y0, x1, y1] in PDF points.
spans
array
Array of span objects. A single line is typically split into multiple spans wherever the font, size, or style changes.

Span object

The smallest unit of text, sharing a single consistent style. Found in textlines[].spans[] and fulltext[].lines[].spans[]. A span break occurs at any change of font, size, weight, colour, or style — so a line reading “Hello World! This is bold” produces two separate spans. See Font Flags Reference for how to interpret the flags field.
{
  "size": 12,
  "flags": 0,
  "bidi": 0,
  "char_flags": 16,
  "font": "Arial",
  "color": 0,
  "alpha": 255,
  "ascender": 0.8,
  "descender": -0.2,
  "text": "Hello World!",
  "origin": [70.69, 304.47],
  "bbox": [70.69, 295.88, 136.09, 304.61],
  "line": 0,
  "block": 0,
  "dir": [1, 0]
}
{
  "size": 12,
  "flags": 20,
  "bidi": 0,
  "char_flags": 24,
  "font": "MinionPro-Bold",
  "color": 0,
  "alpha": 255,
  "ascender": 0.8,
  "descender": -0.2,
  "text": "This is bold",
  "origin": [138.83, 304.47],
  "bbox": [138.83, 296.03, 197.28, 304.63],
  "line": 0,
  "block": 0,
  "dir": [1, 0]
}
text
string
The actual text content of this span.
font
string
Full PostScript font name, e.g. "Arial", "MinionPro-Bold", "Aptos". The font name often encodes weight and style — e.g. -Bold, -It.
size
number
Font size in points.
flags
number
Bitmask of font style flags from the PDF spec. Common values:
  • 0 — regular
  • 4 — serifed font (bit 2)
  • 16 — bold (bit 4)
  • 20 — bold + serifed (bits 2 and 4)
See Font Flags Reference for the full bitmask table.
char_flags
number
Additional character-level flags from the MuPDF structured text API. Refer to the MuPDF structured-text header for the enumeration.
color
number
Text colour as a packed RGB integer. 0 is black (#000000). Decode with: r = (color >> 16) & 0xFF, g = (color >> 8) & 0xFF, b = color & 0xFF.
alpha
number
Opacity of the text from 0 (fully transparent) to 255 (fully opaque).
ascender
number
Font ascender as a fraction of the font size. Typically 0.8, meaning the ascender reaches 80% of the em above the baseline.
descender
number
Font descender as a fraction of the font size. Typically -0.2, meaning the descender extends 20% of the em below the baseline.
bbox
number[4]
Tight bounding box of the rendered glyphs as [x0, y0, x1, y1] in PDF points.
origin
number[2]
The text origin point [x, y] — the position of the baseline at the start of the span, in PDF points.
bidi
number
Unicode bidirectional level. 0 for standard left-to-right text.
line
number
Index of the line this span belongs to within its parent block.
block
number
Index of the block this span belongs to within the page’s content stream.
dir
number[2]
Text direction as a unit vector [x, y]. [1, 0] is standard left-to-right horizontal text. [0, -1] indicates top-to-bottom vertical text.

Reading span data in C#

foreach (JObject span in line["spans"]!)
{
    string text  = span["text"]!.Value<string>()!;
    float  size  = span["size"]!.Value<float>();
    int    flags = span["flags"]!.Value<int>();
    string font  = span["font"]!.Value<string>()!;

    bool isBold      = (flags & 16) != 0;
    bool isSerifed   = (flags & 4)  != 0;

    var bbox = span["bbox"]!.ToObject<float[]>()!;
    Console.WriteLine($"{text} ({font}, {size}pt, bold:{isBold}) @ [{string.Join(", ", bbox)}]");
}

Fulltext block

A raw text block from the PDF content stream, independent of visual layout. Found in pages[].fulltext[]. The fulltext array captures text in the order it appears in the PDF’s internal stream, which may differ from the visual reading order delivered by boxes. Each block contains one or more lines, and each line contains spans.
{
  "type": 0,
  "number": 0,
  "flags": 0,
  "bbox": [70.69, 295.88, 197.28, 304.63],
  "lines": [
    {
      "spans": [...],
      "wmode": 0,
      "dir": [1, 0],
      "bbox": [70.69, 295.88, 197.28, 304.63]
    }
  ]
}
type
number
Block type from the PDF spec. 0 indicates a text block.
number
number
Sequential index of this block within the page’s content stream.
flags
number
Block-level flags. 0 for standard text blocks.
bbox
number[4]
Bounding box of the entire block as [x0, y0, x1, y1] in PDF points.
lines
array
Array of line objects within this block. Each line contains:
  • spans — array of span objects
  • wmode — writing mode (0 = horizontal, 1 = vertical)
  • dir — line direction vector, e.g. [1, 0] for left-to-right
  • bbox — bounding box of the line as [x0, y0, x1, y1]

Metadata object

PDF document-level metadata. Found at the root as metadata.
{
  "format": "PDF 1.6",
  "title": "",
  "author": "",
  "subject": "",
  "keywords": "",
  "creator": "",
  "producer": "",
  "creationDate": "D:20240722172345Z",
  "modDate": "D:20260318153118Z",
  "trapped": "",
  "encryption": null
}
format
string
PDF version string, e.g. "PDF 1.4" or "PDF 1.6".
title
string
Document title as set in the PDF’s document properties. Empty string if not set.
author
string
Document author. Empty string if not set.
subject
string
Document subject. Empty string if not set.
keywords
string
Keywords associated with the document. Empty string if not set.
creator
string
The application that originally created the document before any PDF conversion, e.g. "Microsoft Word". Empty string if not set.
producer
string
The application that produced or last saved the PDF file, e.g. "macOS Quartz PDFContext". Empty string if not set.
creationDate
string
Creation timestamp in PDF date format: D:YYYYMMDDHHmmSSOHH'mm'. Example: "D:20240722172345Z" = 22 July 2024, 17:23:45 UTC.
modDate
string
Last modification timestamp in the same PDF date format.
trapped
string
PDF trapping status. Rarely set in practice; empty string if not applicable.
encryption
string | null
Encryption details if the PDF is encrypted. null for unencrypted documents.

Reading metadata in C#

JObject root     = JObject.Parse(PdfExtractor.ToJson("document.pdf"));
JObject meta     = (JObject)root["metadata"]!;

string format    = meta["format"]!.Value<string>()!;
string title     = meta["title"]!.Value<string>()!;
string author    = meta["author"]!.Value<string>()!;
string created   = meta["creationDate"]!.Value<string>()!;

Console.WriteLine($"{title} by {author} ({format}, created {created})");

See also

Chunk schema

Schema for pageChunks: true output from ToMarkdown().

Extract JSON guide

Working walkthrough with filtering and pipeline examples.

ToJson()

Full API reference for ToJson().

Tables guide

Extracting and working with table blocks.