Overview
PyMuPDF4LLM includes built-in OCR support for scanned documents and image-based PDFs. By default, OCR runs automatically when needed — you don’t have to opt in. For more control, you can force OCR on specific pages, disable it entirely, or swap in a different OCR engine using the adaptor interface.Hybrid OCR strategy
PyMuPDF4LLM applies OCR only when it is genuinely required to obtain the complete text of a PDF page. If a page already contains sufficient extractable text, OCR is skipped entirely — avoiding unnecessary work and eliminating the risk of degrading high-quality digital text. When OCR is needed, PyMuPDF4LLM automatically selects the most suitable OCR plugin available in the runtime environment, balancing detection accuracy with processing speed. Its built-in OCR plugins implement a Hybrid OCR strategy: only those regions lacking extractable, legible text are passed to the OCR engine. This selective approach typically reduces OCR processing time by around 50% while improving recognition accuracy, since the engine focuses exclusively on the problematic regions. The recognized text is then merged back into the original page, enriching it without disturbing existing digital content.Auto-OCR Behaviour
PyMuPDF4LLM inspects each page before extracting text. If a page contains no selectable text — meaning all content is rasterised into images — OCR is triggered automatically for that page. Pages that contain native text only are never sent through OCR. This keeps processing fast and avoids degrading already-clean text.How OCR is triggered
There are two scenarios where OCR is applied automatically: No text at all — if a page contains roughly no text but is covered with images or many character-sized vectors, PyMuPDF4LLM checks whether text is probably detectable on the page. This distinguishes image-based text (e.g. a scanned document) from ordinary pictures like photographs. Garbled text — if a page does contain text but too many characters are unreadable (e.g."�����"), OCR is applied for the affected text areas only, not the full page. This preserves already-readable text, images, and vectors while recovering only what is broken.
Forcing OCR
In some cases you may want to force OCR even on pages that contain selectable text — for example, when the native text layer is corrupt, misencoded, or misaligned with the visual content. Useforce_ocr=True to bypass the auto-detection check entirely:
Disabling OCR
To prevent OCR from running at all — even on pages with no selectable text — setuse_ocr=False:
OCR Engines
Other OCR Engines (otherwise known as OCR Adaptors or Plugins) can be used with PyMuPDF4LLM. See OCR Plugins for details on how to use different OCR engines with PyMuPDF4LLM, including Tesseract, RapidOCR, and how to implement your own custom OCR function.OCR Language Support
When using the default Tesseract adaptor, you can specify one or more languages using Tesseract’s language codes. Specify the language to be used by the Tesseract OCR engine. Default is"eng" (English). Make sure that the respective language data files are installed. Remember to use correct Tesseract language codes. Multiple languages can be specified by concatenating the respective codes with a plus sign "+", for example "eng+deu" for English and German.
See: Tesseract Language Packs for further details.
Performance Tips
OCR is the most compute-intensive part of the extraction pipeline. A few ways to keep it fast:- Process only the pages you need using the
pagesparameter to avoid running OCR on the entire document. - Cache results — write the output to disk after the first run so you don’t re-process the same file.
- Use
force_ocr=False(the default) so clean pages skip OCR entirely. - Resize images before passing to OCR — very high DPI scans can slow Tesseract down without improving accuracy.
Next Steps
Tables
Table extraction explained.
Page Selection
Process only specific pages to speed up OCR-heavy documents.
Installation
Install Tesseract and the OCR optional dependency.
Images & Graphics
Extract embedded images alongside OCR’d text.