Skip to main content

1.27.2.2

Major rework of OCR support:
  • Tesseract-OCR is now supported as a plugin in the ocr installation folder.
  • OCR support has been reworked to automatically choose the most appropriate OCR engine combination, depending on the availability of Python package rapidocr_onnxruntime and Tesseract’s language support files (“tessdata”).
  • Parameter force_ocr=True does no longer require to specify ocr_function. If no OCR function is given, the best available plugin is chosen. An exception is raised only if none of the plugins is usable.

v1.27.2.1

PyMuPDF4LLM now automatically installs and uses pymupdf_layout.
  • Installing pymupdf4llm automatically installs pymupdf_layout. Exact versions of both pymupdf and pymupdf_layout are now pinned (previously pymupdf>=1.27.1 was used).
  • import pymupdf4llm automatically initialises layout support.
  • Layout can be disabled by calling pymupdf4llm.use_layout(False).

v0.3.4

  • #356 — Page chunk output under to_text() may fail for erroneous layout bboxes.
  • Added support for RapidOCR via a callable plugin.
  • Added support for improved OCR via a combination of RapidOCR and Tesseract-OCR.
  • Changed default DPI for OCR to 300 (was 400).
  • Added new parameter ocr_function=None. When not None, must be a callable that OCRs the page by giving it a text layer.
  • Added new parameter force_ocr=False to all extraction functions. Requires ocr_function to be set. When True, ocr_function is called for every page, bypassing the standard OCR worthiness check.

v0.2.9

  • #356 — Page chunk output under to_text() may fail for erroneous layout bboxes.
  • #355 — Image saving fails if the document filename contains folder specifications.
  • Added new top-level function get_key_values() to extract field names and values from Form PDFs. Always available regardless of whether PyMuPDF-Layout is active.
  • Removed OpenCV dependency. Previously used to determine whether a page is worthwhile OCR’ing — replaced with NumPy for these checks.

v0.2.8

  • #349 — Is it possible to change the OCR language when using -layout?
  • #352 — Does not respect the image_path keyword argument when write_images=True.
  • #353 — How do I filter out pixmaps with non-empty size but empty value?
  • Added new parameter ocr_language. A string passed directly to Tesseract-OCR — the user is responsible for correct Tesseract language code formatting.
  • Changed the format of the "page_boxes" key in page chunk dictionaries (layout mode). Now a list of dictionaries (was a list of lists). Each dictionary contains:
    • "index" — 0-based integer enumerating layout boxes in reading order
    • "class" — string denoting the bbox class ("table", "list-item", "section-header", etc.)
    • "bbox"pymupdf.IRect of the layout boundary box
    • "pos"(start, stop) tuple for slicing the bbox text from chunk["text"]
  • Multiple performance improvements, primarily around rectangle containment checks.

v0.2.7

  • #323page_chunks=True parameter was ignored in PyMuPDF-Layout mode.
  • to_markdown() and to_text() now both support page chunk output via page_chunks=True.

v0.2.6

  • Forum — List index out of range in document_layout.py.

v0.2.5

  • #341 — Broken Markdown parsing for a new line directly followed by 'o'.
  • New parameter table_format in to_text() (PyMuPDF-Layout only). Controls the appearance of tables in plain text output. Possible values are defined in tabulate.tabulate_formats. Default is "grid".
  • Optional dependencies can now be installed together: pip install pymupdf4llm[ocr,layout]. The "ocr" extra installs opencv-python for automatic OCR support in PyMuPDF-Layout mode.
  • Major rework of the heuristics that determine whether a page should be OCR’d.

v0.2.4

  • #335KeyError: "has_ocr_text".

v0.2.3

  • #332TypeError: to_markdown() got an unexpected keyword argument 'header'.
  • Output methods now accept a new parameter ocr_dpi=400 which sets the OCR resolution for full-page OCR.
  • The OCR detection heuristics are more fine-grained and now detect more OCR-worthy situations.
  • Resolved multiple performance issues, specifically for documents with very many images and extremely large StructTreeRoot objects.
  • Reflected layout-specific API changes in legacy code — NotImplementedError is now raised when layout-only features are used outside of layout mode.
  • Information messages during document parsing are now written to stdout collectively at the end of the parsing phase.
  • Added support for the page_separators parameter in legacy mode.

v0.2.1

  • #320ValueError: min() iterable argument is empty.
  • #319ValueError: min() arg is an empty sequence.
  • OCR invocation now differentiates between full-page OCR and text-only OCR. If a page contains text but the percentage of unreadable characters exceeds 90%, only the affected text span bounding boxes are OCR’d and replaced — rather than the whole page.

v0.2.0

This release introduces full support for the PyMuPDF-Layout package — a radically new AI-based approach for detecting document page layouts. Highlights:
  • Greatly improved table detection
  • Support for list item hierarchy levels
  • Detection of page headers and footers
  • Improved detection of text paragraphs, titles, and section headers
  • New output options beyond Markdown: plain text (to_text()) and structured JSON (to_json())
  • Automatic OCR detection — invokes Tesseract when the page has little or no readable text, is mostly covered by images, or contains many character-sized vector graphics (requires Tesseract and opencv-python)
PyMuPDF-Layout is not open-source and carries its own licence. It also requires additional packages including onnxruntime, numpy, sympy, and opencv-python. Layout support remains opt-in. To activate it, import pymupdf_layout before importing pymupdf4llm:
import pymupdf_layout
import pymupdf4llm
  • When show_progress=True, the tqdm package is used automatically if installed. Falls back to a built-in text-based progress bar if not available.

v0.0.27

  • #296 — A specific diagram incorrectly recognised as significant.
  • #294 — Unable to extract images from page.
  • #272 — Disappeared page breaks.
  • New parameter page_separators=False in to_markdown(). When True and page_chunks=False, a line --- end of page=nnn --- is appended to each page’s Markdown text. Page number is 0-based. Intended for debugging purposes.

v0.0.26

  • #289 — Content duplication with the latest version.
  • #275 — Text with background missing from output.
  • #262 — Markdown error parsing.
  • The PyMuPDF table module’s to_markdown() now outputs Markdown-styled cell text. Previously, table cells were extracted as plain text only.
  • TocHeaders is now a top-level import and can be used directly.
  • New parameter detect_bg_color=True in to_markdown(). Guesses the page background colour and ignores fill-only vectors matching it. Set to False to always consider fill vectors.
  • Text written with a Type 3 font is now always included. Previously it was treated as invisible and suppressed.
  • Package now includes the GNU AGPL 3.0 licence file. PyMuPDF4LLM is dual-licensed under GNU AGPL 3.0 and individual commercial licences.
  • Added versions_file.py to enforce a minimum PyMuPDF version at import time.

v0.0.25

  • #282 — Content duplication with the latest version.
  • #281 — Latest version returns empty text for some PDFs.
  • #280 — Cannot extract text when ignore_images=False.
  • #278 — Title words are fragmented.
  • #249 — Title duplication in Markdown format.
  • #202 — Bad rect issue.
  • Table module to_markdown() now outputs Markdown-styled cell text.
  • TocHeaders is now a top-level import.
  • Text written with a Type 3 font is now always included.

v0.0.24

  • Fixed UnboundLocalError.

v0.0.23

  • #265 — Code error correction.
  • #263table_strategy=None raises an error.
  • #261 — Wrong Markdown output in latest PyMuPDF versions.
  • High-speed vector graphics count: when graphics_limit is set, drawings are no longer extracted just for counting purposes.

v0.0.22

  • #251 — Images slightly larger than the page size are being ignored.
  • #255 — Single-row or single-column tables are skipped.
  • #258to_markdown() crashes on some documents.
  • Added class TocHeaders as an alternative way to identify headers.

v0.0.21

  • #116 — Handling graphical images and superscripts.

v0.0.20

  • #171 — Text rects overlap with tables and images that should be excluded.
  • #189 — The position of the extracted image is incorrect.
  • #238 — Text extraction missing when text is laid out around a picture.
  • New parameter ignore_images (bool). When True, images are not considered in any way. Useful for pages dense with images that prevent meaningful layout analysis (e.g. PowerPoint slides).
  • New parameter ignore_graphics (bool). When True, vector graphics are not considered except for table detection. Useful for pages dense with vector graphics (e.g. PowerPoint slides).
  • New parameter max_levels on IdentifyHeaders. Limits the number of header tag levels generated. Example: IdentifyHeaders(doc, max_levels=3) ensures at most three header levels are produced.
  • table_strategy=None now disables table detection entirely, which can significantly speed up processing on documents without tables.

v0.0.19

Includes fixes from v0.0.18.
  • #158 — Very long titles when converting to Markdown.
  • #155 — Inconsistent image extraction from image-only PDFs.
  • #161force_text parameter ignored.
  • #162to_markdown() not outputting all pages.
  • #173 — First column of table repeated before the actual table.
  • #187 — Unsolicited text particles.
  • #188 — Slow conversion to Markdown.
  • #191 — Text extraction stops mid-document.
  • #212 — Only one image extracted per page when multiple exist.
  • #213 — Replacement characters (�) appear after conversion.
  • #215 — Excessive time spent identifying text bboxes.
  • #218IndexError in get_raw_lines when processing PDFs with formulas.
  • #225 — Text with background missing from output.
  • #229 — Duplicated table content.
  • New parameter filename (str). Overwrites or sets the filename for saved images. Useful when the document is opened from memory.
  • New parameter use_glyphs (bool). When True, uses the glyph number of a character for fonts without a Unicode back-translation. Default False renders � in these cases.
  • Added strikethrough support — striked-out text is now detected and rendered as ~~text~~.
  • Improved background colour detection — if all four page corners share the same colour, that colour is assumed to be the background. Text and vectors in that colour are ignored.
  • Improved invisible text detection — text with an alpha value of 0 is now ignored.
  • Improved fake-bold detection — text mimicking bold appearance is now treated as standard bold in most cases.
  • Header detection now uses the largest font size on the line. All spans in a header line are rendered with uniform appearance.
  • Changed graphics_limit behaviour: previously, exceeding the limit caused the entire page to be skipped. Now only vector graphics outside table bounding boxes are ignored — images, text, and table content remain extractable.
  • Changed default for margins to 0. The previous default (0, 50, 0, 50) caused confusion by silently ignoring 50pt at the top and bottom of pages.

v0.0.17

  • #147 — Error when page contains nothing but a table.
  • #81 — Issues with bullet points in PDFs.
  • #78 — Multi-column PDF text extraction.

v0.0.15

  • #138 — Table not extracted and some text order incorrect.
  • #135 — Problem with multiple columns in simple text.
  • #134 — Exclude images based on size threshold parameter.
  • #132 — Optionally embed images as base64 string.
  • #128 — Enhanced image embedding format.
  • New parameter embed_images (bool). Embeds images and vector graphics in the Markdown text as base64-encoded strings. Ignores write_images and image_path.
  • New parameter image_size_limit (float, default 0.05). Images are ignored if their width or height is smaller than the corresponding 5% fraction of the page dimensions.
  • Improved algorithm for determining text rectangle sequence on multi-column pages.
  • Header identification change: if more than six header levels are needed, all text larger than body text is treated as level 6 (######).

v0.0.13

  • #112 — Invalid bandwriter header dimensions/setup.
  • New parameter ignore_code. Suppresses special formatting of monospaced text — no code blocks are generated.
  • New parameter extract_words. Enforces page_chunks=True and adds a "words" list to each page dictionary.

v0.0.11

  • #90'Quad' object has no attribute 'tl'.
  • #88 — Bug in is_significant function.
  • Extended the list of recognised bullet point characters.

v0.0.10

  • #73 — Bug in to_markdown internal function.
  • #74 — Minimum area for images and vector graphics.
  • #75 — Poor Markdown generation for a particular PDF.
  • #76 — Suggestion on useful API parameters.
  • Improved recognition of insignificant vector graphics — highlights and borders are now ignored.
  • New parameter image_format to control the format of saved images.
  • New parameter image_path to store images in a specific folder.
  • Images are not stored if they are contained within another image on the same page.
  • Images are not stored if their width or height is less than 5% of the corresponding page dimension.
  • All text is always written. When write_images=True, text on images or graphics can be suppressed by setting force_text=False.

v0.0.9

  • #71 — Unexpected results in pymupdf4llm when pymupdf works correctly.
  • #68 — Issue with text extraction near page footer.
  • Improved identification of scattered text span particles, addressing most out-of-sequence issues.
  • Rotated pages are now correctly processed.

v0.0.8

  • #65 — Fixed typo in pymupdf_rag.py.

v0.0.7

  • #54 — Mistakes in orchestrating sentences. Text extraction no longer uses the TEXT_DEHYPHENATE flag.
  • Improved vector graphics algorithm. Graphics with strokes only near the boundary box border (common in code snippets) are now more reliably classified as irrelevant.

v0.0.6

  • #55IndexError: list index out of range in helpers/multi_column.py.
  • #54 — Mistakes in orchestrating sentences.
  • #52 — Chunking of text files.
  • #41 / #40 — Improved page column detection (partial fix; complex layouts remain a challenge).
  • New parameter dpi to specify the resolution of extracted images.
  • New parameters page_width and page_height for processing reflowable documents (text files, Office, e-books).
  • New parameter graphics_limit to avoid spending runtime on low-value vector graphics content.
  • New parameter table_strategy to directly control the table detection strategy.