Overview
ToJson() returns a JSON string representing a single parsed PDF — its pages, layout boxes, text content, tables, images, and metadata. Deserialise it with your preferred library to traverse the hierarchy.
Positional coordinates are in PDF points (1 point = 1/72 inch). The origin
(0, 0) is the top-left corner of the page.Show full example
Show full example
Root object
The top-level object returned for every extraction.Example
Example
The name of the source PDF file that was parsed.
Total number of pages in the PDF.
Table of contents entries extracted from the PDF. Each entry is an array of
[page_index, title, page_number]. Empty when the PDF has no bookmarks or outline.Array of page objects, one per page in the PDF.
PDF document metadata. See metadata object.
Accessing the root in C#
Page object
Represents a single page of the PDF. Found inpages[].
Example
Example
1-based index of this page within the document.
Page width in PDF points. A standard A4 page is 595.28 pt wide.
Page height in PDF points. A standard A4 page is 841.89 pt tall.
Detected content regions on the page. Each entry is a box object. Boxes are classified as
"text", "picture", or "table".Raw text blocks extracted directly from the PDF’s content stream, independent of the layout box structure. Each entry is a fulltext block. Reflects the logical reading order as encoded in the PDF’s internal stream.
true if the entire page was processed through OCR because no native text layer was found.true if individual text regions on the page were OCR’d rather than extracted natively.Word-level bounding boxes. Empty in the default output; populated when
extractWords is enabled.Hyperlinks found on the page. Empty when no links are present.
Iterating pages in C#
Box object
A detected content region on a page. Found inpages[].boxes[].
Boxes are the primary layout unit. Each box covers a rectangular area and is classified into one of three types: "text", "picture", or "table". Which fields are populated depends on boxclass.
Picture box example
Picture box example
Text box example
Text box example
Table box example
Table box example
Left edge of the box in PDF points, measured from the left of the page.
Top edge of the box in PDF points, measured from the top of the page.
Right edge of the box in PDF points.
Bottom edge of the box in PDF points.
Classification of the content region. One of:
"text"— contains text lines and spans"picture"— contains an embedded image or graphic"table"— contains a detected table structure
Relative path to the extracted image file when
boxclass is "picture". null for all other box types.A table object when
boxclass is "table". null for all other box types.Array of textline objects when
boxclass is "text". Empty array [] for picture boxes. null for table boxes.Iterating boxes by type in C#
Table object
Structured data for a detected table. Found inboxes[].table when boxclass is "table".
Example
Example
Bounding box of the entire table as
[x0, y0, x1, y1] in PDF points.Number of rows in the table, including any header row.
Number of columns in the table.
A 3D array of cell bounding boxes.
cells[row][col] gives [x0, y0, x1, y1] for that cell in PDF points. Useful for mapping extracted text back to exact positions on the page.A 2D array of cell text values.
extract[row][col] gives the string content of that cell. The first row is typically the header row.The table pre-rendered as a Markdown pipe table string, ready for display or further processing.
Accessing table data in C#
Textline object
A single line of text within a text box. Found inboxes[].textlines[].
Example
Example
Bounding box of this text line as
[x0, y0, x1, y1] in PDF points.Array of span objects. A single line is typically split into multiple spans wherever the font, size, or style changes.
Span object
The smallest unit of text, sharing a single consistent style. Found intextlines[].spans[] and fulltext[].lines[].spans[].
A span break occurs at any change of font, size, weight, colour, or style — so a line reading “Hello World! This is bold” produces two separate spans. See Font Flags Reference for how to interpret the flags field.
Example — regular text
Example — regular text
Example — bold text
Example — bold text
The actual text content of this span.
Full PostScript font name, e.g.
"Arial", "MinionPro-Bold", "Aptos". The font name often encodes weight and style — e.g. -Bold, -It.Font size in points.
Bitmask of font style flags from the PDF spec. Common values:
0— regular4— serifed font (bit 2)16— bold (bit 4)20— bold + serifed (bits 2 and 4)
Additional character-level flags from the MuPDF structured text API. Refer to the MuPDF structured-text header for the enumeration.
Text colour as a packed RGB integer.
0 is black (#000000). Decode with: r = (color >> 16) & 0xFF, g = (color >> 8) & 0xFF, b = color & 0xFF.Opacity of the text from
0 (fully transparent) to 255 (fully opaque).Font ascender as a fraction of the font size. Typically
0.8, meaning the ascender reaches 80% of the em above the baseline.Font descender as a fraction of the font size. Typically
-0.2, meaning the descender extends 20% of the em below the baseline.Tight bounding box of the rendered glyphs as
[x0, y0, x1, y1] in PDF points.The text origin point
[x, y] — the position of the baseline at the start of the span, in PDF points.Unicode bidirectional level.
0 for standard left-to-right text.Index of the line this span belongs to within its parent block.
Index of the block this span belongs to within the page’s content stream.
Text direction as a unit vector
[x, y]. [1, 0] is standard left-to-right horizontal text. [0, -1] indicates top-to-bottom vertical text.Reading span data in C#
Fulltext block
A raw text block from the PDF content stream, independent of visual layout. Found inpages[].fulltext[].
The fulltext array captures text in the order it appears in the PDF’s internal stream, which may differ from the visual reading order delivered by boxes. Each block contains one or more lines, and each line contains spans.
Example
Example
Block type from the PDF spec.
0 indicates a text block.Sequential index of this block within the page’s content stream.
Block-level flags.
0 for standard text blocks.Bounding box of the entire block as
[x0, y0, x1, y1] in PDF points.Array of line objects within this block. Each line contains:
spans— array of span objectswmode— writing mode (0= horizontal,1= vertical)dir— line direction vector, e.g.[1, 0]for left-to-rightbbox— bounding box of the line as[x0, y0, x1, y1]
Metadata object
PDF document-level metadata. Found at the root asmetadata.
Example
Example
PDF version string, e.g.
"PDF 1.4" or "PDF 1.6".Document title as set in the PDF’s document properties. Empty string if not set.
Document author. Empty string if not set.
Document subject. Empty string if not set.
Keywords associated with the document. Empty string if not set.
The application that originally created the document before any PDF conversion, e.g.
"Microsoft Word". Empty string if not set.The application that produced or last saved the PDF file, e.g.
"macOS Quartz PDFContext". Empty string if not set.Creation timestamp in PDF date format:
D:YYYYMMDDHHmmSSOHH'mm'. Example: "D:20240722172345Z" = 22 July 2024, 17:23:45 UTC.Last modification timestamp in the same PDF date format.
PDF trapping status. Rarely set in practice; empty string if not applicable.
Encryption details if the PDF is encrypted.
null for unencrypted documents.Reading metadata in C#
See also
Chunk schema
Schema for
pageChunks: true output from ToMarkdown().Extract JSON guide
Working walkthrough with filtering and pipeline examples.
ToJson()
Full API reference for ToJson().
Tables guide
Extracting and working with table blocks.