Overview
Tesseract identifies languages using three-letter ISO 639-2 codes. English (eng) is installed by default on most platforms. For any other language, you need to install the corresponding language pack before pymupdf4llm can use it for OCR.
A full list of supported language codes is available on the Tesseract tessdata repository.
Linux
Language pack installation varies slightly by distribution.- Ubuntu / Debian
- Fedora / RHEL
- Arch Linux
tesseract-ocr-<langcode>, for example tesseract-ocr-fra for French or tesseract-ocr-chi-sim for Simplified Chinese.Manual installation (all distros)
If a language pack is not available through your package manager, download the.traineddata file directly from GitHub and copy it to your Tesseract data directory:
| Distribution | Path |
|---|---|
| Ubuntu / Debian | /usr/share/tesseract-ocr/4.00/tessdata/ |
| Fedora / RHEL | /usr/share/tesseract/tessdata/ |
| Arch Linux | /usr/share/tessdata/ |
Windows
During installation (recommended)
The Tesseract Windows installer from UB Mannheim lets you select additional language packs during setup. When you reach the Choose Components screen, expand Additional language data and tick the languages you need.After installation (manual)
If Tesseract is already installed, download language packs manually:- Go to github.com/tesseract-ocr/tessdata
- Download the
.traineddatafile for your language (e.g.fra.traineddatafor French) - Copy the file into your Tesseract
tessdatafolder, typically:
The Chocolatey (
choco install tesseract) package only includes English. All additional languages must be added manually using the steps above.Verify the install
Open Command Prompt or PowerShell and run:macOS
The recommended approach on macOS is Homebrew. There are two options depending on how much disk space you want to use.Install all languages at once
Thetesseract-lang formula bundles Tesseract with every available language pack:
Install specific languages
If you only need a few languages, installtesseract first and then manually download the .traineddata files you need:
fra with your target language code and adjust the tessdata path to match what brew info tesseract reports on your machine.
If you installed Tesseract via MacPorts instead of Homebrew, use
port install tesseract-<langcode>, for example sudo port install tesseract-fra.Using a language with pymupdf4llm
Once a language pack is installed, pass its code toto_markdown() via the language parameter:
Common language codes
| Language | Code |
|---|---|
| English | eng |
| French | fra |
| German | deu |
| Spanish | spa |
| Italian | ita |
| Portuguese | por |
| Simplified Chinese | chi_sim |
| Traditional Chinese | chi_tra |
| Japanese | jpn |
| Korean | kor |
| Arabic | ara |
| Russian | rus |
| Hindi | hin |