Overview
PyMuPDF4LLM includes automatic table detection. When a table is found on a page, it is extracted and rendered as a GitHub-flavoured Markdown table into_markdown() output, or returned as a structured block in to_json() output.
Table extraction is enabled by default — no configuration required.
How Table Detection Works
PyMuPDF4LLM detects tables by analysing the visual structure of the page — looking for ruled lines, column alignment, and consistent row spacing. It does not rely on tagged PDF structure, so it works on both tagged and untagged PDFs. Detection handles:- Tables with explicit borders (ruled lines on all sides)
- Tables with partial borders (header rule only, or row dividers only)
- Borderless tables detected through column alignment and whitespace
- Multi-line cell content
- Merged header cells
Tables that span multiple pages may not be detected perfectly in all cases. If a table is not rendering as expected, see Troubleshooting below.
Accessing Raw Table Data
When usingto_json(), detected tables are returned as "table" blocks with full cell-level data including bounding boxes:
Table Block Structure
Troubleshooting
Table Not Detected
If a table is being returned as plain text rather than a table block:- The table may be borderless with inconsistent spacing — ensure that
use_layout(True)is enabled to improve detection - The table may be an image (scanned) — enable OCR and check whether cells are being recognised
- The table may be very small or have only one column
Incorrect Column Splitting
If columns are being merged or split incorrectly, the table may have irregular spacing. Accessing the raw data viato_json() and post-processing it manually often gives better results than relying on the Markdown rendering.
For the full API signature, see the
to_markdown() API reference and to_json() API reference.Next Steps
OCR
Control automatic OCR behaviour and adaptors.
Extract JSON
Full guide to working with the JSON output format.
Extract Markdown
Markdown extraction with all common options.
JSON Schema
Complete field reference for the JSON output structure.