Overview
PDF4LLM includes automatic table detection. When a table is found on a page, it is extracted and rendered as a GitHub-flavoured Markdown table inToMarkdown() output, or returned as a structured "table" block in ToJson() output.
Table extraction is enabled by default — no configuration required.
How table detection works
PDF4LLM detects tables by analysing the visual structure of the page — looking for ruled lines, column alignment, and consistent row spacing. It does not rely on tagged PDF structure, so it works on both tagged and untagged PDFs. Detection handles:- Tables with explicit borders (ruled lines on all sides)
- Tables with partial borders (header rule only, or row dividers only)
- Borderless tables detected through column alignment and whitespace
- Multi-line cell content
- Merged header cells
Tables that span multiple pages may not be detected perfectly in all cases. If a table is not rendering as expected, see Troubleshooting below.
Accessing raw table data
When usingToJson(), detected tables are returned as "table" blocks with full cell-level data:
Table block structure
Each"table" block in the JSON output has the following shape:
| Field | Type | Description |
|---|---|---|
type | string | Always "table" for table blocks. |
bbox | [x0, y0, x1, y1] | Bounding box of the entire table in PDF coordinates. |
content | string[][] | Two-dimensional array of cell text. Rows first, columns within each row. |
content is typically the header row, but is not explicitly flagged as such — treat content[0] as the header for tables that clearly have column labels, and validate against a sample of your documents.
Extracting tables to CSV
Use thecontent array from ToJson() to export table data directly to CSV:
Multi-page tables
Tables that span across page boundaries are not automatically merged. Each page’s fragment is returned as a separate table block. To stitch them together, match on column count and append rows manually, skipping the header row on continuation pages:Troubleshooting
Table not detected
If a table is being returned as plain text rather than a"table" block, use ToJson() to inspect the raw layout on that page and confirm how the blocks are classified:
- The table is borderless with inconsistent column spacing — the layout engine could not find a reliable grid
- The table is an image (scanned) — enable OCR and check whether cells are being recognised
- The table has only one column, or is very narrow, and was classified as a text block
Incorrect column splitting
If columns are being merged or split incorrectly, the table may have irregular spacing or proportional fonts that disrupt alignment detection. Accessing the rawcontent array via ToJson() and post-processing it manually often gives better results than relying on the Markdown rendering for these cases.
Merged cells
Tables with horizontally or vertically merged cells (a single cell spanning multiple columns or rows) are not fully represented in thecontent array — the merged cell’s text is preserved but the span relationship is flattened. Use ParseDocument() if you need to inspect cell structure at a lower level, or handle the span reconstruction in your own post-processing step.
Next steps
OCR
Enable OCR for scanned tables that contain no selectable text.
Extract JSON
Full guide to working with the JSON output format.
Extract Markdown
Markdown extraction with all common options.
JSON Schema
Complete field reference for the JSON output structure.