By default, PyMuPDF4LLM processes every page in a document. The pages parameter lets you specify exactly which pages to extract — as a list of zero-based page indices. It is supported by to_markdown(), to_json(), and to_text().
import pymupdf4llm# Extract only the first three pagesmd_text = pymupdf4llm.to_markdown("document.pdf", pages=[0, 1, 2])
Page numbers in PyMuPDF4LLM are zero-based — the first page of a document is page 0, the second is page 1, and so on.
Document Page
pages Index
Page 1
0
Page 2
1
Page 10
9
Last page
n - 1
Passing a page index that doesn’t exist in the document will raise an error. Always check the document’s page count (doc.page_count) before constructing a dynamic page list.
When using page_chunks=True, the returned list will only contain chunks for the pages you specified. Chunk metadata preserves the original page number from the document:
The page value in chunk metadata reflects the original document page number, not the position in the returned list. Page 4 in the document is always reported as 4, regardless of how many pages were skipped.
Combine page selection with a quick content check to skip pages that return no meaningful text:
import pymupdfimport pymupdf4llmdoc = pymupdf.open("document.pdf")# Find pages that have selectable textnon_blank = [ i for i in range(doc.page_count) if doc[i].get_text().strip()]print(f"{len(non_blank)} of {doc.page_count} pages contain text")md_text = pymupdf4llm.to_markdown(doc, pages=non_blank)
The pages parameter is supported by to_markdown(), to_json(), and to_text(). For full API signatures see the API Reference.