Skip to main content

Overview

PDF4LLM includes automatic table detection. When a table is found on a page, it is extracted and rendered as a GitHub-flavoured Markdown table in ToMarkdown() output, or returned as a structured "table" block in ToJson() output. Table extraction is enabled by default — no configuration required.
using PDF4LLM;

string mdText = PdfExtractor.ToMarkdown("document.pdf");
Console.WriteLine(mdText);
A detected table will appear in the Markdown output like this:
| A | B | C | D |
|---|---|---|---|
| 0 | 1 | 2 | 3 |
| 0 | 1 | 2 | 3 |

How table detection works

PDF4LLM detects tables by analysing the visual structure of the page — looking for ruled lines, column alignment, and consistent row spacing. It does not rely on tagged PDF structure, so it works on both tagged and untagged PDFs. Detection handles:
  • Tables with explicit borders (ruled lines on all sides)
  • Tables with partial borders (header rule only, or row dividers only)
  • Borderless tables detected through column alignment and whitespace
  • Multi-line cell content
  • Merged header cells
Tables that span multiple pages may not be detected perfectly in all cases. If a table is not rendering as expected, see Troubleshooting below.

Accessing raw table data

When using ToJson(), detected tables are returned as "table" blocks with full cell-level data:
using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

foreach (JObject page in pages)
{
    int pageNum = page["page_number"]!.Value<int>();
    Console.WriteLine($"\nPage {pageNum}");

    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        if (!string.Equals(box["boxclass"]?.Value<string>(), "table", StringComparison.Ordinal)) continue;

        var tableToken = box["table"];
        var rows = tableToken switch
        {
            // Legacy/simple shape: "table" is directly a 2D array.
            JArray arr => arr.ToObject<List<List<string>>>() ?? [],
            // Current shape: "table" is an object and text content is in "extract".
            JObject obj when obj["extract"] is JArray extract =>
                extract.ToObject<List<List<string>>>() ?? [],
            _ => []
        };
        int rowCount = rows.Count;
        int columnCount = rows.Count > 0 ? rows.Max(r => r?.Count ?? 0) : 0;
        Console.WriteLine($"Table: {rowCount} rows × {columnCount} columns");

        foreach (var row in rows)
            Console.WriteLine(string.Join(" | ", row ?? []));
    }
}

Table block structure

Each "table" block in the JSON output has the following shape:
{
  "type": "table",
  "bbox": [72.0, 200.0, 523.0, 420.0],
  "content": [
    ["A",  "B",  "C",  "D" ],
    ["A1", "B1", "C1", "D1"],
    ["A2", "B2", "C2", "D2"]
  ]
}
FieldTypeDescription
typestringAlways "table" for table blocks.
bbox[x0, y0, x1, y1]Bounding box of the entire table in PDF coordinates.
contentstring[][]Two-dimensional array of cell text. Rows first, columns within each row.
The first row in content is typically the header row, but is not explicitly flagged as such — treat content[0] as the header for tables that clearly have column labels, and validate against a sample of your documents.

Extracting tables to CSV

Use the content array from ToJson() to export table data directly to CSV:
using System.IO;
using System.Linq;
using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("document.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

var csvLines = new List<string>();

foreach (JObject page in pages)
{
    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        if (!string.Equals(box["boxclass"]?.Value<string>(), "table", StringComparison.Ordinal)) continue;

        var tableToken = box["table"];
        var rows = tableToken switch
        {
            // Legacy/simple shape: "table" is directly a 2D array.
            JArray arr => arr.ToObject<List<List<string>>>() ?? [],
            // Current shape: "table" is an object and text content is in "extract".
            JObject obj when obj["extract"] is JArray extract =>
                extract.ToObject<List<List<string>>>() ?? [],
            _ => []
        };

        foreach (var row in rows)
        {
            // Quote cells that contain commas or quotes
            var escaped = row.Select(cell =>
                cell.Contains(',') || cell.Contains('"')
                    ? $"\"{cell.Replace("\"", "\"\"")}\""
                    : cell
            );
            csvLines.Add(string.Join(",", escaped));
        }

        csvLines.Add(""); // blank line between tables
    }
}

File.WriteAllLines("tables.csv", csvLines, System.Text.Encoding.UTF8);

Multi-page tables

Tables that span across page boundaries are not automatically merged. Each page’s fragment is returned as a separate table block. To stitch them together, match on column count and append rows manually, skipping the header row on continuation pages:
using Newtonsoft.Json.Linq;
using PDF4LLM;

string json = PdfExtractor.ToJson("report.pdf");
JToken root = JToken.Parse(json);
JArray pages = root switch
{
    JArray arr => arr,
    JObject obj when obj["pages"] is JArray arr => arr,
    _ => throw new InvalidOperationException("Expected a JSON array or an object containing a 'pages' array.")
};

var mergedRows = new List<List<string>>();
int? prevColCount = null;

foreach (JObject page in pages)
{
    foreach (JObject box in (page["boxes"] as JArray)?.OfType<JObject>() ?? Enumerable.Empty<JObject>())
    {
        if (!string.Equals(box["boxclass"]?.Value<string>(), "table", StringComparison.Ordinal)) continue;

        var tableToken = box["table"];
        var rows = tableToken switch
        {
            // Legacy/simple shape: "table" is directly a 2D array.
            JArray arr => arr.ToObject<List<List<string>>>() ?? [],
            // Current shape: "table" is an object and text content is in "extract".
            JObject obj when obj["extract"] is JArray extract =>
                extract.ToObject<List<List<string>>>() ?? [],
            _ => []
        };
        int colCount = rows[0].Count;

        if (colCount == prevColCount)
        {
            // Continuation page — skip the header row
            mergedRows.AddRange(rows.Skip(1));
        }
        else
        {
            // New table or first page — include the header
            mergedRows.AddRange(rows);
        }

        prevColCount = colCount;
    }
}

Console.WriteLine($"Merged table: {mergedRows.Count} rows");

Troubleshooting

Table not detected

If a table is being returned as plain text rather than a "table" block, use ToJson() to inspect the raw layout on that page and confirm how the blocks are classified:
string json = PdfExtractor.ToJson(
    "document.pdf",
    pages: new List<int> { suspectPageIndex }
);
// Open the JSON and look for the table content.
// If it appears as multiple "type": "text" blocks rather than a
// single "type": "table" block, the layout engine did not detect
// a tabular structure.
Common causes:
  • The table is borderless with inconsistent column spacing — the layout engine could not find a reliable grid
  • The table is an image (scanned) — enable OCR and check whether cells are being recognised
  • The table has only one column, or is very narrow, and was classified as a text block

Incorrect column splitting

If columns are being merged or split incorrectly, the table may have irregular spacing or proportional fonts that disrupt alignment detection. Accessing the raw content array via ToJson() and post-processing it manually often gives better results than relying on the Markdown rendering for these cases.

Merged cells

Tables with horizontally or vertically merged cells (a single cell spanning multiple columns or rows) are not fully represented in the content array — the merged cell’s text is preserved but the span relationship is flattened. Use ParseDocument() if you need to inspect cell structure at a lower level, or handle the span reconstruction in your own post-processing step.

Next steps

OCR

Enable OCR for scanned tables that contain no selectable text.

Extract JSON

Full guide to working with the JSON output format.

Extract Markdown

Markdown extraction with all common options.

JSON Schema

Complete field reference for the JSON output structure.