Tesseract Language Packs

Overview

Tesseract identifies languages using three-letter ISO 639-2 codes. English (eng) is installed by default on most platforms. For any other language, you need to install the corresponding language pack before pymupdf4llm can use it for OCR. A full list of supported language codes is available on the Tesseract tessdata repository.

To see which languages are already installed on your system, run tesseract --list-langs in your terminal.

Linux

Language pack installation varies slightly by distribution.

Ubuntu / Debian
Fedora / RHEL
Arch Linux

# List all available language packs
apt-cache search tesseract-ocr

# Install a specific language (e.g. German)
sudo apt install tesseract-ocr-deu

# Install all available languages at once
sudo apt install tesseract-ocr-all

Language packages follow the naming pattern tesseract-ocr-<langcode>, for example tesseract-ocr-fra for French or tesseract-ocr-chi-sim for Simplified Chinese.

# Search for available language packs
dnf search tesseract

# Install a specific language (e.g. German)
sudo dnf install tesseract-langpack-deu

# Install all language packs
sudo dnf install tesseract-langpack-*

On Fedora, packages are named tesseract-langpack-<langcode>.

# Search for available language packs
pacman -Ss tesseract-data

# Install a specific language (e.g. German)
sudo pacman -S tesseract-data-deu

On Arch, packages are named tesseract-data-<langcode>.

Manual installation (all distros)

If a language pack is not available through your package manager, download the .traineddata file directly from GitHub and copy it to your Tesseract data directory:

# Download language pack (e.g. French)
curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
  -o fra.traineddata

# Copy to tessdata directory (path varies by distro)
sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
# or
sudo cp fra.traineddata /usr/share/tessdata/

Common tessdata locations on Linux:

Distribution	Path
Ubuntu / Debian	`/usr/share/tesseract-ocr/4.00/tessdata/`
Fedora / RHEL	`/usr/share/tesseract/tessdata/`
Arch Linux	`/usr/share/tessdata/`

Windows

During installation (recommended)

The Tesseract Windows installer from UB Mannheim lets you select additional language packs during setup. When you reach the Choose Components screen, expand Additional language data and tick the languages you need.

After installation (manual)

If Tesseract is already installed, download language packs manually:

Go to github.com/tesseract-ocr/tessdata
Download the .traineddata file for your language (e.g. fra.traineddata for French)
Copy the file into your Tesseract tessdata folder, typically:

C:\Program Files\Tesseract-OCR\tessdata\

The Chocolatey (choco install tesseract) package only includes English. All additional languages must be added manually using the steps above.

Verify the install

Open Command Prompt or PowerShell and run:

tesseract --list-langs

Your newly installed language should appear in the output.

macOS

The recommended approach on macOS is Homebrew. There are two options depending on how much disk space you want to use.

Install all languages at once

The tesseract-lang formula bundles Tesseract with every available language pack:

brew install tesseract-lang

Install specific languages

If you only need a few languages, install tesseract first and then manually download the .traineddata files you need:

# Install Tesseract engine only
brew install tesseract

# Find the tessdata directory
brew info tesseract
# Look for a line like: /opt/homebrew/share/tessdata

# Download a specific language pack (e.g. French)
curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
  -o /opt/homebrew/share/tessdata/fra.traineddata

Replace fra with your target language code and adjust the tessdata path to match what brew info tesseract reports on your machine.

If you installed Tesseract via MacPorts instead of Homebrew, use port install tesseract-<langcode>, for example sudo port install tesseract-fra.

Using a language with pymupdf4llm

Once a language pack is installed, pass its code to to_markdown() via the language parameter:

import pymupdf4llm

# Single language
md = pymupdf4llm.to_markdown("document.pdf", ocr_language="fra")

# Multiple languages
md = pymupdf4llm.to_markdown("document.pdf", ocr_language="eng+fra+deu")

Common language codes

Language	Code
English	`eng`
French	`fra`
German	`deu`
Spanish	`spa`
Italian	`ita`
Portuguese	`por`
Simplified Chinese	`chi_sim`
Traditional Chinese	`chi_tra`
Japanese	`jpn`
Korean	`kor`
Arabic	`ara`
Russian	`rus`
Hindi	`hin`

For the full list of supported languages and their codes, see the Tesseract tessdata repository.

Getting Started

Guides

Integrations

Reference

Tesseract Language Packs

Overview

Linux

Manual installation (all distros)

Windows

During installation (recommended)

After installation (manual)

Verify the install

macOS

Install all languages at once

Install specific languages

Using a language with pymupdf4llm

Common language codes

Getting Started

Guides

Integrations

Reference

​Overview

​Linux

​Manual installation (all distros)

​Windows

​During installation (recommended)

​After installation (manual)

​Verify the install

​macOS

​Install all languages at once

​Install specific languages

​Using a language with pymupdf4llm

​Common language codes

Overview

Linux

Manual installation (all distros)

Windows

During installation (recommended)

After installation (manual)

Verify the install

macOS

Install all languages at once

Install specific languages

Using a language with pymupdf4llm

Common language codes