Skip to main content

Overview

to_json() returns document content as structured data rather than a Markdown string. Every text block, image, table, and drawing on each page is represented as a dictionary with positional and styling metadata attached. This makes it the right choice when you need to:
  • Build a custom rendering or post-processing pipeline
  • Access bounding box coordinates for text regions
  • Preserve font, size, and color information
  • Pass structured layout data to a downstream ML model or search index
import pymupdf4llm

data = pymupdf4llm.to_json("document.pdf")

Output Structure

The return value is a list of page objects — one per page and file metadata. See the JSON Schema for a full field reference.

Working with Bounding Boxes

Every block, line, and span carries a bbox field — a four-element list [x0, y0, x1, y1] describing the rectangle that bounds that element.
import json

json_str = pymupdf4llm.to_json("document.pdf")
data = json.loads(json_str)

for page in data:
    print(f"\nPage {page_num}")

    for block in page.get("boxes", []):
        print(f"Block at ({block['x0']:.1f}, {block['y0']:.1f}) → ({block['x1']:.1f}, {block['y1']:.1f})")


Extracting Span-Level Data

Spans are the most granular unit in the JSON output. Each span represents a run of text that shares the same font, size, and color. This lets you identify headings, bold text, and other styled elements programmatically:
import json

json_str = pymupdf4llm.to_json("document.pdf")
data = json.loads(json_str)

for page_num, page in enumerate(data.get("pages", [])):
    for block in page.get("boxes", []):
        if block["boxclass"] == "text":
            textlines = block["textlines"]
            for line in textlines:
                for span in line["spans"]:
                    print(span)
                    if span["size"] >= 14:
                        print(f"Heading candidate: {span['text']!r} (size {span['size']})")
                    if span["flags"] & 2**4:  # bold flag
                        print(f"Bold text: {span['text']!r}")

Font Flags Reference

The flags field is a bitmask encoding font properties:
BitValueMeaning
01Superscript
12Italic
24Serifed font
38Monospaced font
416Bold

Example Interpretation

If we consider the following JSON:
"spans": 
[
  {
    "size": 12,
    "flags": 6,
    "bidi": 0,
    "char_flags": 16,
    "font": "MinionPro-It",
    "color": 0,
    "alpha": 255,
    "ascender": 0.800000011920929,
    "descender": -0.20000000298023224,
    "text": "Italic text.",
    "origin": [
      72,
      444.47998046875
    ],
    "bbox": [
      72,
      435.93597412109375,
      122.60799407958984,
      444.6239929199219
    ],
    "line": 0,
    "block": 0,
    "dir": [
      1,
      0
    ]
  },
  {
    "size": 12,
    "flags": 0,
    "bidi": 0,
    "char_flags": 16,
    "font": "Arial",
    "color": 0,
    "alpha": 255,
    "ascender": 0.800000011920929,
    "descender": -0.20000000298023224,
    "text": "Hello World!",
    "origin": [
      122.625,
      444.47998046875
    ],
    "bbox": [
      122.625,
      436.31787109375,
      184.802001953125,
      444.59130859375
    ],
    "line": 0,
    "block": 0,
    "dir": [
      1,
      0
    ]
  },
  {
    "size": 12,
    "flags": 20,
    "bidi": 0,
    "char_flags": 24,
    "font": "MinionPro-Bold",
    "color": 0,
    "alpha": 255,
    "ascender": 0.800000011920929,
    "descender": -0.20000000298023224,
    "text": "This is bold",
    "origin": [
      187.53399658203125,
      444.47998046875
    ],
    "bbox": [
      187.53399658203125,
      436.0439758300781,
      245.98001098632812,
      444.635986328125
    ],
    "line": 0,
    "block": 0,
    "dir": [
      1,
      0
    ]
  }
]
A practical parse of the span flags might look like this:

flags = 6

flags = 6 on “Italic text.” with font MinionPro-It 6 = 2 + 4 this is consistent with italic + serifed text.

flags = 0

flags = 0 on “Hello World!” with font Arial 0 is consistent with regular text in PyMuPDF’s span flag scheme.

flags = 20

flags = 20 on “This is bold” with font MinionPro-Bold 20 = 16 + 4 this is consistent with bold + serifed text. So the extracted styling in plain English is: “Italic text.” → italic “Hello World!” → regular “This is bold” → bold

Page Selection

As with to_markdown(), you can limit extraction to specific pages:
data = pymupdf4llm.to_json("document.pdf", pages=[0, 1, 2])

Saving JSON Output

Write the result to a .json file using Python’s json module:
import pymupdf4llm
import json
from pathlib import Path

data = pymupdf4llm.to_json("document.pdf")

Path("output.json").write_text(
    json.dumps(data, indent=2, ensure_ascii=False),
    encoding="utf-8"
)
Use ensure_ascii=False to preserve non-Latin characters such as accented letters, CJK characters, and symbols.

Full Example: Building a Custom Text Pipeline

import pymupdf4llm
import json

# parse the file 
json_str = pymupdf4llm.to_json("document.pdf")

# Convert JSON to Python Dictionary and iterate through the content
data = json.loads(json_str)

def parse_span_flags(flags: int):
    return {
        "superscript": bool(flags & 1),
        "italic": bool(flags & 2),
        "serifed": bool(flags & 4),
        "monospaced": bool(flags & 8),
        "bold": bool(flags & 16),
    }

# iterate through the document
for page_num, page in enumerate(data.get("pages", [])):
    print(f"\nPage {page_num}")

    for block in page.get("boxes", []):
        if block["boxclass"] == "text":
            for line in block["textlines"]:
                for span in line["spans"]:
                    text = span.get("text", "")
                    flags = span.get("flags", 0)
                    styles = parse_span_flags(flags)

                    print({
                        "text": text,
                        "flags": flags,
                        "styles": styles
                    })

For the full API signature, see the to_json() API reference.

Next Steps

JSON Schema

Full field descriptions for every object in the JSON output.

Extract Markdown

Preserve structure and formatting for LLM pipelines.

Extract Text

Get clean, plain text output.

Tables

Table block structure explained.