By default, PDF4LLM processes every page in a document. The pages parameter lets you specify exactly which pages to extract — as a List<int> of zero-based page indices. It is supported by ToMarkdown(), ToJson(), and ToText().
using PDF4LLM;// Extract only the first three pagesstring mdText = PdfExtractor.ToMarkdown("document.pdf", pages: new List<int> { 0, 1, 2 });
Page numbers in PDF4LLM are zero-based — the first page of a document is page 0, the second is page 1, and so on.
Document page
pages index
Page 1
0
Page 2
1
Page 10
9
Last page
n - 1
Passing a page index that doesn’t exist in the document will raise an exception. Always check the document’s page count (doc.PageCount) before constructing a dynamic page list.
When using LlamaMarkdownReader, the returned list will only contain chunks for the pages you specify if you pre-filter the results. Each chunk’s ExtraInfo preserves the original page number from the document:
using PDF4LLM;var reader = PdfExtractor.LlamaMarkdownReader();var allChunks = reader.LoadData("document.pdf");// Filter to pages 4, 5, and 6 after loadingvar chunks = allChunks .Where(c => new[] { 4, 5, 6 }.Contains((int)c.ExtraInfo["page"])) .ToList();foreach (var chunk in chunks){ int page = (int)chunk.ExtraInfo["page"]; Console.WriteLine($"Page {page}: {chunk.Text.Length} chars");}// Page 4: 1842 chars// Page 5: 2103 chars// Page 6: 987 chars
The page value in ExtraInfo reflects the original document page number, not the position in the returned list. Page 4 in the document is always reported as 4, regardless of how many pages were skipped.