Signature
Parameters
Path to the document file, or an already-opened
pymupdf.Document instance. Supports PDF, XPS, eBooks, and — with PyMuPDF Pro — Office formats.Does a simple check for the general background color of the pages. If any text or vector has this color it will be ignored. May increase detection accuracy.
Desired image resolution in dots per inch. Relevant only if
write_images=True or embed_images=True.Like
write_images, but images are included in the Markdown text as base64-encoded strings. Mutually exclusive with write_images — ignores image_path. May drastically increase the size of the Markdown output.Enforces
page_chunks=True and adds a "words" key to each page dictionary. Its value is a list of words as delivered by PyMuPDF’s Page.get_text("words"), in the same sequence as the extracted text.Overwrites or sets the desired image file name for written images. Useful when the document is provided as a memory object with no inherent file name.
Minimum font size to consider for text extraction. Text with a font size below this threshold is excluded. Default of
3 means only text with a font size >= 3 will be extracted.Controls whether footer text is included in the extraction output. Set to
False to omit repetitive footer content that does not add value to the extracted data.When
True, OCR is applied to all pages regardless of their content. Useful for documents known to be image-based that do not meet the default criteria for automatic OCR. When False, OCR is only applied to pages that meet the default criteria.Generate text output even when text overlaps images or graphics. The overlapping text will appear after the respective image in the output.
Limits processing of vector graphics elements. If the number of vector graphics on a page exceeds this threshold, all vector graphics on that page are ignored. Useful for scientific documents or pages that simulate text via graphics commands, which can contain tens of thousands of objects and cause intolerable runtimes.
Custom header detection logic. Accepts a callable or an object with a
get_header_id method. Must accept a text span dictionary (as returned by extractDICT) and a keyword argument page (the owning Page object), and must return either "" or a string of up to 6 # characters followed by a space.If None, a full document scan is performed to identify the most popular font sizes and derive header levels from them. To skip header detection entirely, pass hdr_info=False or hdr_info=lambda s, page=None: "".Controls whether header text is included in the extraction output. Set to
False to omit repetitive header content that does not add value to the extracted data.When
True, includes text that is completely transparent. By default, transparent text is ignored, which usually increases detection accuracy.When
True, monospaced text lines are not given special formatting and no code blocks are generated. Automatically set to True when extract_words=True.Disregards all vector graphics on the page. Useful for crowded pages such as presentation slides, and speeds up processing time. Automatically disables table detection.
Disregards all images on the page. Useful for crowded pages such as presentation slides, and speeds up processing time.
Desired output format for extracted images, specified as a file extension. Common values are
"png" and "jpg". All PyMuPDF supported output image formats are accepted.Directory in which to save extracted images. Relevant only when
write_images=True. Defaults to the script’s directory.A value in the range
0 <= value < 1. Images are excluded from output if their width is less than or equal to image_size_limit × page width, or their height is less than or equal to image_size_limit × page height. The default of 0.05 means an image’s width and height must each exceed 5% of the page dimensions to be included.Page border margins. Only content within the specified margins is considered for extraction. Accepts a single float or a sequence of 2 or 4 floats:
- A single float
fexpands to(f, f, f, f)— equal margins on all sides. - A 2-element sequence
(top, bottom)expands to(0, top, 0, bottom). - A 4-element sequence specifies
(left, top, right, bottom)directly. - The default
0reads the full page with no margins applied.
Image resolution in dots per inch used when rendering a page to an intermediate image for OCR. Only relevant for pages determined to benefit from OCR (e.g. pages with little or no text, or pages largely covered by images or character-like vectors). Higher values may improve OCR precision but increase memory usage and processing time, and risk over-sharpening the image. The default of
300 should be sufficient for most documents.A custom OCR function to use in place of the built-in Tesseract engine. If
None, Tesseract is used automatically. See the OCR engines documentation for the expected function signature.Language code passed to the Tesseract OCR engine. Defaults to
"eng" (English). Multiple languages can be combined with a + separator — for example, "eng+deu" for English and German. Ensure the corresponding Tesseract language data files are installed before use.When
The
True, returns a list of dictionaries — one per page — instead of a single string. Each dictionary has the following keys:metadata — a dictionary containing the document’s standard metadata, enriched with three additional keys:| Key | Description |
|---|---|
file_path | Source file name |
page_count | Total number of pages in the document |
page_number | 1-based page number |
toc_items — a list of Table of Contents entries pointing to this page. Each item has the format [lvl, title, pagenumber], where lvl is the hierarchy level, title is a string, and pagenumber is a 1-based integer.tables — a list of tables detected on the page. Each item is a dictionary with keys bbox (a pymupdf.Rect in tuple format giving the table’s position), row_count, and col_count.images — a list of images on the page, as returned by Page.get_image_info().graphics — a list of bounding boxes for clustered vector graphics on the page, as returned by Page.cluster_drawings().text — the page content as a Markdown string.words — populated when extract_words=True. A list of tuples (x0, y0, x1, y1, "wordstring", bno, lno, wno) as returned by page.get_text("words"). The sequence of tuples matches the reading order of the Markdown text, including correct ordering across multi-column layouts and table row cells.page_boxes — a list of layout boundary box dictionaries. Each dictionary has the following structure:pos field gives the character range of this box’s content within the page’s "text" string: box_text = chunk["text"][start:stop].Desired page height in points. Only relevant for reflowable documents — see
page_width. When None, the document is treated as a single large page with no Markdown page separators in the output (or a single chunk if page_chunks=True).When
True, inserts a separator string --- end of page=n --- (wrapped with line breaks) at the end of each page’s output. Page numbers are 0-based. Intended for debugging purposes.Desired page width in points. Ignored for documents with fixed page dimensions such as PDF and XPS. For reflowable documents — e-books, Office files, and plain text files — which have no fixed page size, the default assumes Letter format width (
612) with unlimited page height, meaning the entire document is treated as one large page.The pages to include in the output, specified as 0-based page numbers. Accepts any Python sequence of integers. The sequence is automatically sorted and deduplicated. If
None, all pages are processed.When
True, displays a progress bar as pages are converted. Uses the tqdm package if installed, otherwise falls back to a built-in text-based progress bar.The table detection strategy to use. The default
"lines_strict" ignores background colours. Alternative strategies such as "lines" use all vector graphics objects for detection and may perform better on some documents. See Page.find_tables() for all available strategies.When
True, uses the glyph number of a character instead of its Unicode value for fonts that do not store Unicode mappings. Useful for documents with non-standard or symbolic fonts.When
True, applies OCR to pages that meet the default criteria for OCR processing. See force_ocr to apply OCR unconditionally to all pages.When
True, images and vector graphics are rasterised from their page area and saved to disk. Markdown image references are inserted inline at the corresponding positions. Text within those areas is excluded from the text output and appears only as part of the saved image.When using PyMuPDF Layout mode, regions classified as "picture" by the layout module are treated as images regardless of whether they contain text, images, or vector graphics. If force_text=True is also set, text within those regions is still extracted and appended after the image reference.Returns
When
page_chunks=False (default). A single Markdown string containing the content of all extracted pages, concatenated in order.When
page_chunks=True. A list of dictionaries, one per extracted page, each with the following keys:| Key | Type | Description |
|---|---|---|
text | str | Markdown content of the page |
metadata | dict | Page metadata — see Chunk Schema for the full schema |
Raises
| Exception | Condition |
|---|---|
FileNotFoundError | doc is a path string that does not exist |
ValueError | An index in pages is out of range for the document |
ImportError | use_layout(True) but the layout dependency is not installed |
ImportError | ocr=True or force_ocr=True but the ocr dependency is not installed |
Examples
Minimal
Page chunks
Images and specific pages
Force OCR on all pages
See Also
Extract Markdown Guide
Full walkthrough with all common options.
to_json()
Structured output with bounding boxes and layout data.
to_text()
Plain text output without Markdown syntax.