Overview
to_json() returns document content as structured data rather than a Markdown string. Every text block, image, table, and drawing on each page is represented as a dictionary with positional and styling metadata attached.
This makes it the right choice when you need to:
- Build a custom rendering or post-processing pipeline
- Access bounding box coordinates for text regions
- Preserve font, size, and color information
- Pass structured layout data to a downstream ML model or search index
Output Structure
The return value is a list of page objects — one per page and file metadata. See the JSON Schema for a full field reference.Working with Bounding Boxes
Every block, line, and span carries abbox field — a four-element list [x0, y0, x1, y1] describing the rectangle that bounds that element.
Extracting Span-Level Data
Spans are the most granular unit in the JSON output. Each span represents a run of text that shares the same font, size, and color. This lets you identify headings, bold text, and other styled elements programmatically:Font Flags Reference
Theflags field is a bitmask encoding font properties:
| Bit | Value | Meaning |
|---|---|---|
| 0 | 1 | Superscript |
| 1 | 2 | Italic |
| 2 | 4 | Serifed font |
| 3 | 8 | Monospaced font |
| 4 | 16 | Bold |
Example Interpretation
If we consider the following JSON:flags = 6
flags = 6 on “Italic text.” with font MinionPro-It
6 = 2 + 4
this is consistent with italic + serifed text.
flags = 0
flags = 0 on “Hello World!” with font Arial
0 is consistent with regular text in PyMuPDF’s span flag scheme.
flags = 20
flags = 20 on “This is bold” with font MinionPro-Bold
20 = 16 + 4
this is consistent with bold + serifed text.
So the extracted styling in plain English is:
“Italic text.” → italic
“Hello World!” → regular
“This is bold” → bold
Page Selection
As withto_markdown(), you can limit extraction to specific pages:
Saving JSON Output
Write the result to a.json file using Python’s json module:
Full Example: Building a Custom Text Pipeline
For the full API signature, see the
to_json() API reference.Next Steps
JSON Schema
Full field descriptions for every object in the JSON output.
Extract Markdown
Preserve structure and formatting for LLM pipelines.
Extract Text
Get clean, plain text output.
Tables
Table block structure explained.