1.27.2.2
Major rework of OCR support:- Tesseract-OCR is now supported as a plugin in the ocr installation folder.
- OCR support has been reworked to automatically choose the most appropriate OCR engine combination, depending on the availability of Python package
rapidocr_onnxruntimeand Tesseract’s language support files (“tessdata”). - Parameter
force_ocr=Truedoes no longer require to specifyocr_function. If no OCR function is given, the best available plugin is chosen. An exception is raised only if none of the plugins is usable.
v1.27.2.1
PyMuPDF4LLM now automatically installs and usespymupdf_layout.
- Installing
pymupdf4llmautomatically installspymupdf_layout. Exact versions of bothpymupdfandpymupdf_layoutare now pinned (previouslypymupdf>=1.27.1was used). import pymupdf4llmautomatically initialises layout support.- Layout can be disabled by calling
pymupdf4llm.use_layout(False).
v0.3.4
Fixes
Fixes
- #356 — Page chunk output under
to_text()may fail for erroneous layout bboxes.
Changes
Changes
- Added support for RapidOCR via a callable plugin.
- Added support for improved OCR via a combination of RapidOCR and Tesseract-OCR.
- Changed default DPI for OCR to
300(was400). - Added new parameter
ocr_function=None. When notNone, must be a callable that OCRs the page by giving it a text layer. - Added new parameter
force_ocr=Falseto all extraction functions. Requiresocr_functionto be set. WhenTrue,ocr_functionis called for every page, bypassing the standard OCR worthiness check.
v0.2.9
Fixes
Fixes
Changes
Changes
- Added new top-level function
get_key_values()to extract field names and values from Form PDFs. Always available regardless of whether PyMuPDF-Layout is active. - Removed OpenCV dependency. Previously used to determine whether a page is worthwhile OCR’ing — replaced with NumPy for these checks.
v0.2.8
Fixes
Fixes
Changes
Changes
- Added new parameter
ocr_language. A string passed directly to Tesseract-OCR — the user is responsible for correct Tesseract language code formatting. - Changed the format of the
"page_boxes"key in page chunk dictionaries (layout mode). Now a list of dictionaries (was a list of lists). Each dictionary contains:"index"— 0-based integer enumerating layout boxes in reading order"class"— string denoting the bbox class ("table","list-item","section-header", etc.)"bbox"—pymupdf.IRectof the layout boundary box"pos"—(start, stop)tuple for slicing the bbox text fromchunk["text"]
- Multiple performance improvements, primarily around rectangle containment checks.
v0.2.7
Fixes
Fixes
- #323 —
page_chunks=Trueparameter was ignored in PyMuPDF-Layout mode.
Changes
Changes
to_markdown()andto_text()now both support page chunk output viapage_chunks=True.
v0.2.6
Fixes
Fixes
- Forum — List index out of range in
document_layout.py.
v0.2.5
Fixes
Fixes
- #341 — Broken Markdown parsing for a new line directly followed by
'o'.
Changes
Changes
- New parameter
table_formatinto_text()(PyMuPDF-Layout only). Controls the appearance of tables in plain text output. Possible values are defined intabulate.tabulate_formats. Default is"grid". - Optional dependencies can now be installed together:
pip install pymupdf4llm[ocr,layout]. The"ocr"extra installsopencv-pythonfor automatic OCR support in PyMuPDF-Layout mode. - Major rework of the heuristics that determine whether a page should be OCR’d.
v0.2.4
Fixes
Fixes
- #335 —
KeyError: "has_ocr_text".
v0.2.3
Fixes
Fixes
- #332 —
TypeError: to_markdown() got an unexpected keyword argument 'header'.
Changes
Changes
- Output methods now accept a new parameter
ocr_dpi=400which sets the OCR resolution for full-page OCR. - The OCR detection heuristics are more fine-grained and now detect more OCR-worthy situations.
- Resolved multiple performance issues, specifically for documents with very many images and extremely large
StructTreeRootobjects. - Reflected layout-specific API changes in legacy code —
NotImplementedErroris now raised when layout-only features are used outside of layout mode. - Information messages during document parsing are now written to stdout collectively at the end of the parsing phase.
- Added support for the
page_separatorsparameter in legacy mode.
v0.2.1
Fixes
Fixes
Changes
Changes
- OCR invocation now differentiates between full-page OCR and text-only OCR. If a page contains text but the percentage of unreadable characters exceeds 90%, only the affected text span bounding boxes are OCR’d and replaced — rather than the whole page.
v0.2.0
This release introduces full support for the PyMuPDF-Layout package — a radically new AI-based approach for detecting document page layouts. Highlights:- Greatly improved table detection
- Support for list item hierarchy levels
- Detection of page headers and footers
- Improved detection of text paragraphs, titles, and section headers
- New output options beyond Markdown: plain text (
to_text()) and structured JSON (to_json()) - Automatic OCR detection — invokes Tesseract when the page has little or no readable text, is mostly covered by images, or contains many character-sized vector graphics (requires Tesseract and
opencv-python)
PyMuPDF-Layout is not open-source and carries its own licence. It also requires additional packages including
onnxruntime, numpy, sympy, and opencv-python. Layout support remains opt-in. To activate it, import pymupdf_layout before importing pymupdf4llm:Changes
Changes
- When
show_progress=True, thetqdmpackage is used automatically if installed. Falls back to a built-in text-based progress bar if not available.
v0.0.27
Fixes
Fixes
Changes
Changes
- New parameter
page_separators=Falseinto_markdown(). WhenTrueandpage_chunks=False, a line--- end of page=nnn ---is appended to each page’s Markdown text. Page number is 0-based. Intended for debugging purposes.
v0.0.26
Fixes
Fixes
Changes
Changes
- The PyMuPDF table module’s
to_markdown()now outputs Markdown-styled cell text. Previously, table cells were extracted as plain text only. TocHeadersis now a top-level import and can be used directly.- New parameter
detect_bg_color=Trueinto_markdown(). Guesses the page background colour and ignores fill-only vectors matching it. Set toFalseto always consider fill vectors. - Text written with a
Type 3font is now always included. Previously it was treated as invisible and suppressed. - Package now includes the GNU AGPL 3.0 licence file. PyMuPDF4LLM is dual-licensed under GNU AGPL 3.0 and individual commercial licences.
- Added
versions_file.pyto enforce a minimum PyMuPDF version at import time.
v0.0.25
Fixes
Fixes
Changes
Changes
- Table module
to_markdown()now outputs Markdown-styled cell text. TocHeadersis now a top-level import.- Text written with a
Type 3font is now always included.
v0.0.24
Fixes
Fixes
- Fixed
UnboundLocalError.
v0.0.23
Fixes
Fixes
Changes
Changes
- High-speed vector graphics count: when
graphics_limitis set, drawings are no longer extracted just for counting purposes.
v0.0.22
Fixes
Fixes
Changes
Changes
- Added class
TocHeadersas an alternative way to identify headers.
v0.0.21
Fixes
Fixes
- #116 — Handling graphical images and superscripts.
v0.0.20
Fixes
Fixes
Changes
Changes
- New parameter
ignore_images(bool). WhenTrue, images are not considered in any way. Useful for pages dense with images that prevent meaningful layout analysis (e.g. PowerPoint slides). - New parameter
ignore_graphics(bool). WhenTrue, vector graphics are not considered except for table detection. Useful for pages dense with vector graphics (e.g. PowerPoint slides). - New parameter
max_levelsonIdentifyHeaders. Limits the number of header tag levels generated. Example:IdentifyHeaders(doc, max_levels=3)ensures at most three header levels are produced. table_strategy=Nonenow disables table detection entirely, which can significantly speed up processing on documents without tables.
v0.0.19
Fixes
Fixes
Includes fixes from v0.0.18.
- #158 — Very long titles when converting to Markdown.
- #155 — Inconsistent image extraction from image-only PDFs.
- #161 —
force_textparameter ignored. - #162 —
to_markdown()not outputting all pages. - #173 — First column of table repeated before the actual table.
- #187 — Unsolicited text particles.
- #188 — Slow conversion to Markdown.
- #191 — Text extraction stops mid-document.
- #212 — Only one image extracted per page when multiple exist.
- #213 — Replacement characters (�) appear after conversion.
- #215 — Excessive time spent identifying text bboxes.
- #218 —
IndexErroringet_raw_lineswhen processing PDFs with formulas. - #225 — Text with background missing from output.
- #229 — Duplicated table content.
Changes
Changes
- New parameter
filename(str). Overwrites or sets the filename for saved images. Useful when the document is opened from memory. - New parameter
use_glyphs(bool). WhenTrue, uses the glyph number of a character for fonts without a Unicode back-translation. DefaultFalserenders�in these cases. - Added strikethrough support — striked-out text is now detected and rendered as
~~text~~. - Improved background colour detection — if all four page corners share the same colour, that colour is assumed to be the background. Text and vectors in that colour are ignored.
- Improved invisible text detection — text with an alpha value of
0is now ignored. - Improved fake-bold detection — text mimicking bold appearance is now treated as standard bold in most cases.
- Header detection now uses the largest font size on the line. All spans in a header line are rendered with uniform appearance.
- Changed
graphics_limitbehaviour: previously, exceeding the limit caused the entire page to be skipped. Now only vector graphics outside table bounding boxes are ignored — images, text, and table content remain extractable. - Changed default for
marginsto0. The previous default(0, 50, 0, 50)caused confusion by silently ignoring 50pt at the top and bottom of pages.
v0.0.17
v0.0.15
Fixes
Fixes
Changes
Changes
- New parameter
embed_images(bool). Embeds images and vector graphics in the Markdown text as base64-encoded strings. Ignoreswrite_imagesandimage_path. - New parameter
image_size_limit(float, default0.05). Images are ignored if their width or height is smaller than the corresponding 5% fraction of the page dimensions. - Improved algorithm for determining text rectangle sequence on multi-column pages.
- Header identification change: if more than six header levels are needed, all text larger than body text is treated as level 6 (
######).
v0.0.13
Fixes
Fixes
- #112 — Invalid bandwriter header dimensions/setup.
Changes
Changes
- New parameter
ignore_code. Suppresses special formatting of monospaced text — no code blocks are generated. - New parameter
extract_words. Enforcespage_chunks=Trueand adds a"words"list to each page dictionary.
v0.0.11
Changes
Changes
- Extended the list of recognised bullet point characters.
v0.0.10
Fixes
Fixes
Changes
Changes
- Improved recognition of insignificant vector graphics — highlights and borders are now ignored.
- New parameter
image_formatto control the format of saved images. - New parameter
image_pathto store images in a specific folder. - Images are not stored if they are contained within another image on the same page.
- Images are not stored if their width or height is less than 5% of the corresponding page dimension.
- All text is always written. When
write_images=True, text on images or graphics can be suppressed by settingforce_text=False.
v0.0.9
Fixes
Fixes
Changes
Changes
- Improved identification of scattered text span particles, addressing most out-of-sequence issues.
- Rotated pages are now correctly processed.
v0.0.8
Fixes
Fixes
- #65 — Fixed typo in
pymupdf_rag.py.
v0.0.7
Fixes
Fixes
- #54 — Mistakes in orchestrating sentences. Text extraction no longer uses the
TEXT_DEHYPHENATEflag.
Changes
Changes
- Improved vector graphics algorithm. Graphics with strokes only near the boundary box border (common in code snippets) are now more reliably classified as irrelevant.
v0.0.6
Fixes
Fixes
Changes
Changes
- New parameter
dpito specify the resolution of extracted images. - New parameters
page_widthandpage_heightfor processing reflowable documents (text files, Office, e-books). - New parameter
graphics_limitto avoid spending runtime on low-value vector graphics content. - New parameter
table_strategyto directly control the table detection strategy.