Skip to main content

Signature

pymupdf4llm.to_markdown(doc: pymupdf.Document | str, *, 
    detect_bg_color: bool = True, 
    dpi: int = 150, 
    embed_images: bool = False, 
    extract_words: bool = False, 
    filename: str | None = None, 
    fontsize_limit: float = 3, 
    footer: bool = True, 
    force_ocr: bool = False, 
    force_text: bool = True, 
    graphics_limit: int = None, 
    hdr_info: Any = None, 
    header: bool = True, 
    ignore_alpha: bool = False, 
    ignore_code: bool = False, 
    ignore_graphics: bool = False, 
    ignore_images: bool = False, 
    image_format: str = "png", 
    image_path: str = "", 
    image_size_limit: float = 0.05, 
    margins: float | list = 0, 
    ocr_dpi: int = 300, 
    ocr_function: callable = None, 
    ocr_language: str = "eng", 
    page_chunks: bool = False, 
    page_height: float = None, 
    page_separators: bool = False, 
    page_width: float = 612, 
    pages: list | range | None = None, 
    show_progress: bool = False, 
    table_strategy: str = "lines_strict", 
    use_glyphs: bool = False, 
    use_ocr: bool = True, 
    write_images: bool = False) -> str | list[dict]
) -> str | list[dict]

Parameters

doc
str | pymupdf.Document
required
Path to the document file, or an already-opened pymupdf.Document instance. Supports PDF, XPS, eBooks, and — with PyMuPDF Pro — Office formats.
detect_bg_color
bool
default:"True"
Does a simple check for the general background color of the pages. If any text or vector has this color it will be ignored. May increase detection accuracy.
dpi
int
default:"150"
Desired image resolution in dots per inch. Relevant only if write_images=True or embed_images=True.
embed_images
bool
default:"False"
Like write_images, but images are included in the Markdown text as base64-encoded strings. Mutually exclusive with write_images — ignores image_path. May drastically increase the size of the Markdown output.
extract_words
bool
default:"False"
Enforces page_chunks=True and adds a "words" key to each page dictionary. Its value is a list of words as delivered by PyMuPDF’s Page.get_text("words"), in the same sequence as the extracted text.
filename
str
Overwrites or sets the desired image file name for written images. Useful when the document is provided as a memory object with no inherent file name.
fontsize_limit
float
default:"3"
Minimum font size to consider for text extraction. Text with a font size below this threshold is excluded. Default of 3 means only text with a font size >= 3 will be extracted.
Controls whether footer text is included in the extraction output. Set to False to omit repetitive footer content that does not add value to the extracted data.
force_ocr
bool
default:"False"
When True, OCR is applied to all pages regardless of their content. Useful for documents known to be image-based that do not meet the default criteria for automatic OCR. When False, OCR is only applied to pages that meet the default criteria.
Requires ocr_function to be specified — an exception will be raised if it is not.
force_text
bool
default:"False"
Generate text output even when text overlaps images or graphics. The overlapping text will appear after the respective image in the output.
graphics_limit
int
Limits processing of vector graphics elements. If the number of vector graphics on a page exceeds this threshold, all vector graphics on that page are ignored. Useful for scientific documents or pages that simulate text via graphics commands, which can contain tens of thousands of objects and cause intolerable runtimes.
hdr_info
callable | object | bool
Custom header detection logic. Accepts a callable or an object with a get_header_id method. Must accept a text span dictionary (as returned by extractDICT) and a keyword argument page (the owning Page object), and must return either "" or a string of up to 6 # characters followed by a space.If None, a full document scan is performed to identify the most popular font sizes and derive header levels from them. To skip header detection entirely, pass hdr_info=False or hdr_info=lambda s, page=None: "".
header
bool
default:"True"
Controls whether header text is included in the extraction output. Set to False to omit repetitive header content that does not add value to the extracted data.
ignore_alpha
bool
default:"False"
When True, includes text that is completely transparent. By default, transparent text is ignored, which usually increases detection accuracy.
ignore_code
bool
default:"False"
When True, monospaced text lines are not given special formatting and no code blocks are generated. Automatically set to True when extract_words=True.
ignore_graphics
bool
default:"False"
Disregards all vector graphics on the page. Useful for crowded pages such as presentation slides, and speeds up processing time. Automatically disables table detection.
ignore_images
bool
default:"False"
Disregards all images on the page. Useful for crowded pages such as presentation slides, and speeds up processing time.
image_format
str
default:"png"
Desired output format for extracted images, specified as a file extension. Common values are "png" and "jpg". All PyMuPDF supported output image formats are accepted.
image_path
str
Directory in which to save extracted images. Relevant only when write_images=True. Defaults to the script’s directory.
image_size_limit
float
default:"0.05"
A value in the range 0 <= value < 1. Images are excluded from output if their width is less than or equal to image_size_limit × page width, or their height is less than or equal to image_size_limit × page height. The default of 0.05 means an image’s width and height must each exceed 5% of the page dimensions to be included.
margins
float | list[float]
default:"0"
Page border margins. Only content within the specified margins is considered for extraction. Accepts a single float or a sequence of 2 or 4 floats:
  • A single float f expands to (f, f, f, f) — equal margins on all sides.
  • A 2-element sequence (top, bottom) expands to (0, top, 0, bottom).
  • A 4-element sequence specifies (left, top, right, bottom) directly.
  • The default 0 reads the full page with no margins applied.
ocr_dpi
int
default:"300"
Image resolution in dots per inch used when rendering a page to an intermediate image for OCR. Only relevant for pages determined to benefit from OCR (e.g. pages with little or no text, or pages largely covered by images or character-like vectors). Higher values may improve OCR precision but increase memory usage and processing time, and risk over-sharpening the image. The default of 300 should be sufficient for most documents.
ocr_function
callable
A custom OCR function to use in place of the built-in Tesseract engine. If None, Tesseract is used automatically. See the OCR engines documentation for the expected function signature.
ocr_language
str
default:"eng"
Language code passed to the Tesseract OCR engine. Defaults to "eng" (English). Multiple languages can be combined with a + separator — for example, "eng+deu" for English and German. Ensure the corresponding Tesseract language data files are installed before use.
page_chunks
bool
default:"False"
When True, returns a list of dictionaries — one per page — instead of a single string. Each dictionary has the following keys:metadata — a dictionary containing the document’s standard metadata, enriched with three additional keys:
KeyDescription
file_pathSource file name
page_countTotal number of pages in the document
page_number1-based page number
toc_items — a list of Table of Contents entries pointing to this page. Each item has the format [lvl, title, pagenumber], where lvl is the hierarchy level, title is a string, and pagenumber is a 1-based integer.tables — a list of tables detected on the page. Each item is a dictionary with keys bbox (a pymupdf.Rect in tuple format giving the table’s position), row_count, and col_count.images — a list of images on the page, as returned by Page.get_image_info().graphics — a list of bounding boxes for clustered vector graphics on the page, as returned by Page.cluster_drawings().text — the page content as a Markdown string.words — populated when extract_words=True. A list of tuples (x0, y0, x1, y1, "wordstring", bno, lno, wno) as returned by page.get_text("words"). The sequence of tuples matches the reading order of the Markdown text, including correct ordering across multi-column layouts and table row cells.page_boxes — a list of layout boundary box dictionaries. Each dictionary has the following structure:
  {
    "index": 0,                      // 0-based reading order index
    "class": "text",                 // region type: "text", "picture", "table", etc.
    "bbox": [x0, y0, x1, y1],       // boundary box coordinates
    "pos": [start, stop]             // slice indices into chunk["text"]
  }
The pos field gives the character range of this box’s content within the page’s "text" string: box_text = chunk["text"][start:stop].
page_height
float
default:"None"
Desired page height in points. Only relevant for reflowable documents — see page_width. When None, the document is treated as a single large page with no Markdown page separators in the output (or a single chunk if page_chunks=True).
page_separators
bool
default:"False"
When True, inserts a separator string --- end of page=n --- (wrapped with line breaks) at the end of each page’s output. Page numbers are 0-based. Intended for debugging purposes.
page_width
float
default:"612"
Desired page width in points. Ignored for documents with fixed page dimensions such as PDF and XPS. For reflowable documents — e-books, Office files, and plain text files — which have no fixed page size, the default assumes Letter format width (612) with unlimited page height, meaning the entire document is treated as one large page.
pages
list[int]
default:"None"
The pages to include in the output, specified as 0-based page numbers. Accepts any Python sequence of integers. The sequence is automatically sorted and deduplicated. If None, all pages are processed.
show_progress
bool
default:"False"
When True, displays a progress bar as pages are converted. Uses the tqdm package if installed, otherwise falls back to a built-in text-based progress bar.
table_strategy
str
default:"lines_strict"
The table detection strategy to use. The default "lines_strict" ignores background colours. Alternative strategies such as "lines" use all vector graphics objects for detection and may perform better on some documents. See Page.find_tables() for all available strategies.
use_glyphs
bool
default:"False"
When True, uses the glyph number of a character instead of its Unicode value for fonts that do not store Unicode mappings. Useful for documents with non-standard or symbolic fonts.
use_ocr
bool
default:"False"
When True, applies OCR to pages that meet the default criteria for OCR processing. See force_ocr to apply OCR unconditionally to all pages.
write_images
bool
default:"False"
When True, images and vector graphics are rasterised from their page area and saved to disk. Markdown image references are inserted inline at the corresponding positions. Text within those areas is excluded from the text output and appears only as part of the saved image.
If your document contains text rendered on top of full-page background images, set write_images=False to ensure that text is extracted rather than captured as part of the image.
When using PyMuPDF Layout mode, regions classified as "picture" by the layout module are treated as images regardless of whether they contain text, images, or vector graphics. If force_text=True is also set, text within those regions is still extracted and appended after the image reference.

Returns

str
string
When page_chunks=False (default). A single Markdown string containing the content of all extracted pages, concatenated in order.
list[dict]
list
When page_chunks=True. A list of dictionaries, one per extracted page, each with the following keys:
KeyTypeDescription
textstrMarkdown content of the page
metadatadictPage metadata — see Chunk Schema for the full schema

Raises

ExceptionCondition
FileNotFoundErrordoc is a path string that does not exist
ValueErrorAn index in pages is out of range for the document
ImportErroruse_layout(True) but the layout dependency is not installed
ImportErrorocr=True or force_ocr=True but the ocr dependency is not installed

Examples

Minimal

import pymupdf4llm

md = pymupdf4llm.to_markdown("document.pdf")

Page chunks

chunks = pymupdf4llm.to_markdown(
    "document.pdf",
    page_chunks=True
)
for chunk in chunks:
    print(chunk["metadata"]["page"], chunk["text"][:100])

Images and specific pages

md = pymupdf4llm.to_markdown(
    "document.pdf",
    pages=[0, 1, 2],
    write_images=True,
    image_path="assets/",
    image_format="png",
    dpi=300
)

Force OCR on all pages

md = pymupdf4llm.to_markdown("scanned.pdf", force_ocr=True)

See Also

Extract Markdown Guide

Full walkthrough with all common options.

to_json()

Structured output with bounding boxes and layout data.

to_text()

Plain text output without Markdown syntax.