Extract JSON

Overview

to_json() returns document content as structured data rather than a Markdown string. Every text block, image, table, and drawing on each page is represented as a dictionary with positional and styling metadata attached. This makes it the right choice when you need to:

Build a custom rendering or post-processing pipeline
Access bounding box coordinates for text regions
Preserve font, size, and color information
Pass structured layout data to a downstream ML model or search index

import pymupdf4llm

data = pymupdf4llm.to_json("document.pdf")

Output Structure

The return value is a list of page objects — one per page and file metadata. See the JSON Schema for a full field reference.

Working with Bounding Boxes

Every block, line, and span carries a bbox field — a four-element list [x0, y0, x1, y1] describing the rectangle that bounds that element.

import json

json_str = pymupdf4llm.to_json("document.pdf")
data = json.loads(json_str)

for page in data:
    print(f"\nPage {page_num}")

    for block in page.get("boxes", []):
        print(f"Block at ({block['x0']:.1f}, {block['y0']:.1f}) → ({block['x1']:.1f}, {block['y1']:.1f})")

Extracting Span-Level Data

Spans are the most granular unit in the JSON output. Each span represents a run of text that shares the same font, size, and color. This lets you identify headings, bold text, and other styled elements programmatically:

import json

json_str = pymupdf4llm.to_json("document.pdf")
data = json.loads(json_str)

for page_num, page in enumerate(data.get("pages", [])):
    for block in page.get("boxes", []):
        if block["boxclass"] == "text":
            textlines = block["textlines"]
            for line in textlines:
                for span in line["spans"]:
                    print(span)
                    if span["size"] >= 14:
                        print(f"Heading candidate: {span['text']!r} (size {span['size']})")
                    if span["flags"] & 2**4:  # bold flag
                        print(f"Bold text: {span['text']!r}")

Font Flags Reference

The flags field is a bitmask encoding font properties:

Bit	Value	Meaning
0	`1`	Superscript
1	`2`	Italic
2	`4`	Serifed font
3	`8`	Monospaced font
4	`16`	Bold

Example Interpretation

If we consider the following JSON:

"spans": 
[
  {
    "size": 12,
    "flags": 6,
    "bidi": 0,
    "char_flags": 16,
    "font": "MinionPro-It",
    "color": 0,
    "alpha": 255,
    "ascender": 0.800000011920929,
    "descender": -0.20000000298023224,
    "text": "Italic text.",
    "origin": [
      72,
      444.47998046875
    ],
    "bbox": [
      72,
      435.93597412109375,
      122.60799407958984,
      444.6239929199219
    ],
    "line": 0,
    "block": 0,
    "dir": [
      1,
      0
    ]
  },
  {
    "size": 12,
    "flags": 0,
    "bidi": 0,
    "char_flags": 16,
    "font": "Arial",
    "color": 0,
    "alpha": 255,
    "ascender": 0.800000011920929,
    "descender": -0.20000000298023224,
    "text": "Hello World!",
    "origin": [
      122.625,
      444.47998046875
    ],
    "bbox": [
      122.625,
      436.31787109375,
      184.802001953125,
      444.59130859375
    ],
    "line": 0,
    "block": 0,
    "dir": [
      1,
      0
    ]
  },
  {
    "size": 12,
    "flags": 20,
    "bidi": 0,
    "char_flags": 24,
    "font": "MinionPro-Bold",
    "color": 0,
    "alpha": 255,
    "ascender": 0.800000011920929,
    "descender": -0.20000000298023224,
    "text": "This is bold",
    "origin": [
      187.53399658203125,
      444.47998046875
    ],
    "bbox": [
      187.53399658203125,
      436.0439758300781,
      245.98001098632812,
      444.635986328125
    ],
    "line": 0,
    "block": 0,
    "dir": [
      1,
      0
    ]
  }
]

A practical parse of the span flags might look like this:

flags = 6

flags = 6 on “Italic text.” with font MinionPro-It 6 = 2 + 4 this is consistent with italic + serifed text.

flags = 0

flags = 0 on “Hello World!” with font Arial 0 is consistent with regular text in PyMuPDF’s span flag scheme.

flags = 20

flags = 20 on “This is bold” with font MinionPro-Bold 20 = 16 + 4 this is consistent with bold + serifed text. So the extracted styling in plain English is: “Italic text.” → italic “Hello World!” → regular “This is bold” → bold

Page Selection

As with to_markdown(), you can limit extraction to specific pages:

data = pymupdf4llm.to_json("document.pdf", pages=[0, 1, 2])

Saving JSON Output

Write the result to a .json file using Python’s json module:

import pymupdf4llm
import json
from pathlib import Path

data = pymupdf4llm.to_json("document.pdf")

Path("output.json").write_text(
    json.dumps(data, indent=2, ensure_ascii=False),
    encoding="utf-8"
)

Use ensure_ascii=False to preserve non-Latin characters such as accented letters, CJK characters, and symbols.

Full Example: Building a Custom Text Pipeline

import pymupdf4llm
import json

# parse the file 
json_str = pymupdf4llm.to_json("document.pdf")

# Convert JSON to Python Dictionary and iterate through the content
data = json.loads(json_str)

def parse_span_flags(flags: int):
    return {
        "superscript": bool(flags & 1),
        "italic": bool(flags & 2),
        "serifed": bool(flags & 4),
        "monospaced": bool(flags & 8),
        "bold": bool(flags & 16),
    }

# iterate through the document
for page_num, page in enumerate(data.get("pages", [])):
    print(f"\nPage {page_num}")

    for block in page.get("boxes", []):
        if block["boxclass"] == "text":
            for line in block["textlines"]:
                for span in line["spans"]:
                    text = span.get("text", "")
                    flags = span.get("flags", 0)
                    styles = parse_span_flags(flags)

                    print({
                        "text": text,
                        "flags": flags,
                        "styles": styles
                    })

For the full API signature, see the to_json() API reference.

Next Steps

JSON Schema

Full field descriptions for every object in the JSON output.

Extract Markdown

Preserve structure and formatting for LLM pipelines.

Extract Text

Get clean, plain text output.

Tables

Table block structure explained.

Getting Started

Guides

Integrations

Reference

Overview

Output Structure

Working with Bounding Boxes

Extracting Span-Level Data

Font Flags Reference

Example Interpretation

flags = 6

flags = 0

flags = 20

Page Selection

Saving JSON Output

Full Example: Building a Custom Text Pipeline

Next Steps

JSON Schema

Extract Markdown

Extract Text

Tables

Getting Started

Guides

Integrations

Reference

​Overview

​Output Structure

​Working with Bounding Boxes

​Extracting Span-Level Data

​Font Flags Reference

​Example Interpretation

​flags = 6

​flags = 0

​flags = 20

​Page Selection

​Saving JSON Output

​Full Example: Building a Custom Text Pipeline

​Next Steps

JSON Schema

Extract Markdown

Extract Text

Tables

Overview

Output Structure

Working with Bounding Boxes

Extracting Span-Level Data

Font Flags Reference

Example Interpretation

flags = 6

flags = 0

flags = 20

Page Selection

Saving JSON Output

Full Example: Building a Custom Text Pipeline

Next Steps