Skip to main content

Overview

PyMuPDF4LLM includes automatic table detection. When a table is found on a page, it is extracted and rendered as a GitHub-flavoured Markdown table in to_markdown() output, or returned as a structured block in to_json() output. Table extraction is enabled by default — no configuration required.
import pymupdf4llm

md_text = pymupdf4llm.to_markdown("document.pdf")
print(md_text)
A detected table will appear in the Markdown output like this:
| A | B | C | D |
|---|---|---|---|
| 0 | 1 | 2 | 3 |
| 0 | 1 | 2 | 3 |

How Table Detection Works

PyMuPDF4LLM detects tables by analysing the visual structure of the page — looking for ruled lines, column alignment, and consistent row spacing. It does not rely on tagged PDF structure, so it works on both tagged and untagged PDFs. Detection handles:
  • Tables with explicit borders (ruled lines on all sides)
  • Tables with partial borders (header rule only, or row dividers only)
  • Borderless tables detected through column alignment and whitespace
  • Multi-line cell content
  • Merged header cells
Tables that span multiple pages may not be detected perfectly in all cases. If a table is not rendering as expected, see Troubleshooting below.

Accessing Raw Table Data

When using to_json(), detected tables are returned as "table" blocks with full cell-level data including bounding boxes:
json_str = pymupdf4llm.to_json("document.pdf")

data = json.loads(json_str)

for page_num, page in enumerate(data.get("pages", [])):
    print(f"\nPage {page_num}")

    for block in page.get("boxes", []):
        if block["boxclass"] == "table":
            print(f"Table details: {block['table']}")

Table Block Structure

{
  "boxclass": "table",
  "table":
    {
      "bbox": ["x0","y0","x1","y1"], 
      "row_count": 3, 
      "col_count": 4, 
      "cells": [], 
      "extract": [
          ["A", "B", "C", "D"], 
          ["A1", "B1", "C1", "D1"],
          ["A2", "B2", "C2", "D2"]
        ], 
        "markdown": "|A|B|C|D|\n|---|---|---|---|\n|A1|B1|C1|D1|\n|A2|B2|C2|D2|\n\n"
    }
}

Troubleshooting

Table Not Detected

If a table is being returned as plain text rather than a table block:
  • The table may be borderless with inconsistent spacing — ensure that use_layout(True) is enabled to improve detection
  • The table may be an image (scanned) — enable OCR and check whether cells are being recognised
  • The table may be very small or have only one column

Incorrect Column Splitting

If columns are being merged or split incorrectly, the table may have irregular spacing. Accessing the raw data via to_json() and post-processing it manually often gives better results than relying on the Markdown rendering.
For the full API signature, see the to_markdown() API reference and to_json() API reference.

Next Steps

OCR

Control automatic OCR behaviour and adaptors.

Extract JSON

Full guide to working with the JSON output format.

Extract Markdown

Markdown extraction with all common options.

JSON Schema

Complete field reference for the JSON output structure.