to_markdown() is the primary extraction function in PyMuPDF4LLM. It reads a document and returns its content as a Markdown string, preserving headings, lists, tables, code blocks, images, and reading order as closely as possible.
Return a list of per-page dictionaries instead of a single concatenated string. Each chunk includes the page’s Markdown text and associated metadata:
chunks = pymupdf4llm.to_markdown("document.pdf", page_chunks=True)for chunk in chunks: print(f"Page {chunk['metadata']['page']}") print(chunk["text"])
This is the recommended mode for RAG pipelines and LLM ingestion workflows. See Chunk Schema for more details on the structure of the returned dictionaries.
import pymupdf4llmfrom pathlib import Pathchunks = pymupdf4llm.to_markdown( "report.pdf", pages=[0, 1, 2, 3, 4], # first five pages only page_chunks=True, # return per-page dictionaries write_images=True, # extract images to disk image_path="assets/", # image output directory image_format="png", # image format dpi=200 # image resolution)# Save each page as a separate Markdown filefor chunk in chunks: page_num = chunk["metadata"]["page"] Path(f"output/page-{page_num}.md").write_text(chunk["text"], encoding="utf-8")