Overview
Most PDFs contain selectable text — characters stored as font glyphs with known positions. PDF4LLM extracts these directly, without OCR, at high speed. Some PDFs don’t. A document scanned on a photocopier, a fax saved as PDF, or a report exported from a system that rasterises each page before writing it — these contain no machine-readable text at all. Every page is just an image. Native extraction returns empty strings. For these documents, PDF4LLM can invoke Tesseract OCR before running layout analysis. PassuseOcr: true to any extraction method and Tesseract will read the text from each page image before it is converted to Markdown, plain text, or JSON.
Prerequisites
OCR requires Tesseract to be installed on the host system and available on thePATH. PDF4LLM does not bundle Tesseract.
Windows
Download and run the installer from the UB Mannheim Tesseract builds — the most actively maintained Windows distribution. During installation, select any additional language packs you need. After installation, add the Tesseract directory to yourPATH:
macOS
Linux (Debian / Ubuntu)
Verify Tesseract is on the PATH
PDF4LLM calls Tesseract as a subprocess. If Tesseract is installed but not on thePATH, you will get a TesseractNotFoundException at runtime. Confirm it is reachable from the process running your application:
Basic usage
PassuseOcr: true to ToMarkdown, ToText, or ParseDocument:
Specifying a language
Tesseract uses language-specific data files to improve recognition accuracy. The default is English ("eng"). Pass a Tesseract language code to ocrLanguage for other languages:
Common Tesseract language codes
| Language | Code | Language | Code |
|---|---|---|---|
| English | eng | Russian | rus |
| French | fra | Arabic | ara |
| German | deu | Hindi | hin |
| Spanish | spa | Japanese | jpn |
| Italian | ita | Korean | kor |
| Portuguese | por | Simplified Chinese | chi_sim |
| Dutch | nld | Traditional Chinese | chi_tra |
| Polish | pol | Turkish | tur |
Multi-language documents
If a document contains text in more than one language on the same page, pass a+-separated list of language codes:
pages parameter.
Performance
OCR is significantly slower than native text extraction. Tesseract rasterises each page to an image and runs a trained neural network over it — this typically takes 1–5 seconds per page depending on page dimensions, resolution, and hardware, compared to milliseconds for native extraction. For a 200-page scanned document, OCR may take 5–15 minutes. Plan for this in your pipeline.Process only the pages that need OCR
The most impactful optimisation is to apply OCR only to pages that actually need it. Many documents are partially scanned — a cover page or appendix may be a rasterised image while the body contains selectable text. Identify scanned pages by checking whether native extraction returns useful content:< 50) to your documents. Pages with only a page number or a short chapter title will score low on native extraction — tune conservatively to avoid classifying lightly-populated text pages as scanned.
Process pages in parallel
For large all-scanned documents, parallelise across pages. Each call toToMarkdown with a single-page pages list is independent:
Confirm whether your version of MuPDF.NET supports concurrent access to a shared
Document object before using this pattern. If it does not, open a separate Document per task using the file path overload to avoid race conditions.Show progress for long runs
For documents where OCR will take a noticeable amount of time, enable progress reporting:Mixed documents
A mixed document contains both selectable text pages and scanned image pages. The pattern below produces a single Markdown string in page order, using native extraction where possible and OCR where not:ToText once per page as a cheap probe — native extraction is fast — then calls ToMarkdown with the appropriate setting. The total cost is one native-speed pass over every page plus OCR only on the pages that require it.
OCR accuracy
Tesseract accuracy depends heavily on the quality of the input image. Several factors affect results. Resolution — Tesseract is trained on 300 DPI images. Scans below 200 DPI produce noticeably worse results, especially for small or condensed text. If you control the scanning process, scan at 300 DPI minimum. Skew — Pages rotated even a few degrees during scanning significantly reduce accuracy. Most modern scanners de-skew automatically; if yours doesn’t, apply de-skewing pre-processing before extraction. Noise and artefacts — Coffee stains, smudges, fax compression artefacts, and paper grain all reduce accuracy. These cannot be corrected within PDF4LLM. Apply image pre-processing — binarisation, noise removal, contrast enhancement — to extracted page images before passing them to Tesseract if accuracy is critical for your use case. Font type — Tesseract performs best on standard serif and sans-serif fonts. Handwriting, decorative fonts, and highly stylised typefaces are recognised poorly and should not be expected to produce reliable output. Language selection — Using the wrong language model reduces accuracy even for text that looks superficially similar between languages. Always setocrLanguage to match the document language.
Diagnosing poor accuracy
UseToText with OCR to inspect raw recognition output without the added complexity of Markdown formatting:
| What you see | Likely cause |
|---|---|
l / 1 / I confusion | Low resolution or thin font strokes |
0 / O confusion | Low resolution or sans-serif font at small size |
| Missing spaces between words | Page DPI below 200 or high background noise |
| Garbled non-Latin characters | Wrong ocrLanguage or missing language pack |
| Entire paragraphs absent | Region classified as an image, not a text block |
| Correct words in wrong order | Multi-column layout not linearised correctly post-OCR |
OCR in containerised environments
When running PDF4LLM in Docker or a CI pipeline, Tesseract must be present in the container image. Add it to yourDockerfile:
Tessdata path
Tesseract looks for language data files in the directory specified by theTESSDATA_PREFIX environment variable, or in the default system location (/usr/share/tesseract-ocr/*/tessdata/ on Debian/Ubuntu). If you install language data files to a custom location, set the variable explicitly in your Dockerfile:
Troubleshooting
TesseractNotFoundException at runtime
Tesseract is not on the PATH for the process running your application. Verify with tesseract --version in the same environment — not just your interactive shell. In Docker, check with docker run --rm your-image tesseract --version.
Empty or near-empty output despite useOcr: true
The pages may already contain selectable text that is being extracted natively without invoking OCR. Run ToText without useOcr first and check whether content is returned. If OCR is invoked but still returns nothing, the scan DPI is likely very low — check the source image resolution.
Garbled or nonsensical text
The most common cause is a mismatched ocrLanguage. Confirm the document language and set the correct Tesseract code. For non-Latin scripts (Arabic, CJK, Devanagari), ensure the appropriate language pack is installed and the correct code is used.
OCR is extremely slow
Processing time scales linearly with page count. Use the page-filtering pattern to restrict OCR to only scanned pages. For bulk pipelines, distribute work across multiple workers rather than processing large documents serially.
Tables in scanned documents are not being detected
Table detection from OCR output relies on the spatial alignment of recognised character positions, which is less reliable than detecting tables in native PDF text. For scanned documents with critical table data, inspect ToJson output to see how the blocks were classified, and consider building a custom table renderer from ParseDocument for these cases.
Language pack missing error from Tesseract
Install the required pack for your platform. Debian/Ubuntu: sudo apt-get install tesseract-ocr-{code}. macOS: brew install tesseract-lang. Windows: re-run the Tesseract installer and select the language from the component list.
Next steps
Tables
Table extraction explained.
Page Selection
Process only specific pages to speed up OCR-heavy documents.
Installation
Install Tesseract and the OCR optional dependency.
Images & Graphics
Extract embedded images alongside OCR’d text.