> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pdf4llm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Tesseract Language Packs

> How to install additional Tesseract language packs on Windows, macOS, and Linux for use with PDF4LLM OCR.

<div id="apiIndicatorBadge">
  <div class="inner dotnet" />
</div>

## Overview

Tesseract identifies languages using three-letter [ISO 639-2](https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes) codes. English (`eng`) is installed by default on most platforms. For any other language, you need to install the corresponding language pack before PDF4LLM can use it for OCR.

A full list of supported language codes is available on the [Tesseract tessdata repository](https://github.com/tesseract-ocr/tessdata).

<Tip>
  To see which languages are already installed on your system, run `tesseract --list-langs` in your terminal.
</Tip>

***

## Windows

### During installation (recommended)

The Tesseract Windows installer from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki) lets you select additional language packs during setup. When you reach the **Choose Components** screen, expand **Additional language data** and tick the languages you need.

### After installation (manual)

If Tesseract is already installed, download language packs manually:

1. Go to [github.com/tesseract-ocr/tessdata](https://github.com/tesseract-ocr/tessdata)
2. Download the `.traineddata` file for your language (e.g. `fra.traineddata` for French)
3. Copy the file into your Tesseract `tessdata` folder, typically:

```
C:\Program Files\Tesseract-OCR\tessdata\
```

<Note>
  The Chocolatey (`choco install tesseract`) package only includes English. All additional languages must be added manually using the steps above.
</Note>

### Verify the install

Open Command Prompt or PowerShell and run:

```powershell theme={null}
tesseract --list-langs
```

Your newly installed language should appear in the output.

***

## Linux

Language pack installation varies slightly by distribution.

<Tabs>
  <Tab title="Ubuntu / Debian">
    ```bash theme={null}
    # List all available language packs
    apt-cache search tesseract-ocr

    # Install a specific language (e.g. German)
    sudo apt install tesseract-ocr-deu

    # Install all available languages at once
    sudo apt install tesseract-ocr-all
    ```

    Language packages follow the naming pattern `tesseract-ocr-<langcode>`, for example `tesseract-ocr-fra` for French or `tesseract-ocr-chi-sim` for Simplified Chinese.
  </Tab>

  <Tab title="Fedora / RHEL">
    ```bash theme={null}
    # Search for available language packs
    dnf search tesseract

    # Install a specific language (e.g. German)
    sudo dnf install tesseract-langpack-deu

    # Install all language packs
    sudo dnf install tesseract-langpack-*
    ```

    On Fedora, packages are named `tesseract-langpack-<langcode>`.
  </Tab>

  <Tab title="Arch Linux">
    ```bash theme={null}
    # Search for available language packs
    pacman -Ss tesseract-data

    # Install a specific language (e.g. German)
    sudo pacman -S tesseract-data-deu
    ```

    On Arch, packages are named `tesseract-data-<langcode>`.
  </Tab>
</Tabs>

### Manual installation (all distros)

If a language pack is not available through your package manager, download the `.traineddata` file directly from GitHub and copy it to your Tesseract data directory:

```bash theme={null}
# Download language pack (e.g. French)
curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
  -o fra.traineddata

# Copy to tessdata directory (path varies by distro)
sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
# or
sudo cp fra.traineddata /usr/share/tessdata/
```

Common tessdata locations on Linux:

| Distribution    | Path                                      |
| --------------- | ----------------------------------------- |
| Ubuntu / Debian | `/usr/share/tesseract-ocr/4.00/tessdata/` |
| Fedora / RHEL   | `/usr/share/tesseract/tessdata/`          |
| Arch Linux      | `/usr/share/tessdata/`                    |

***

## macOS

The recommended approach on macOS is [Homebrew](https://brew.sh). There are two options depending on how much disk space you want to use.

### Install all languages at once

The `tesseract-lang` formula bundles Tesseract with every available language pack:

```bash theme={null}
brew install tesseract-lang
```

### Install specific languages

If you only need a few languages, install `tesseract` first and then manually download the `.traineddata` files you need:

```bash theme={null}
# Install Tesseract engine only
brew install tesseract

# Find the tessdata directory
brew info tesseract
# Look for a line like: /opt/homebrew/share/tessdata

# Download a specific language pack (e.g. French)
curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
  -o /opt/homebrew/share/tessdata/fra.traineddata
```

Replace `fra` with your target language code and adjust the tessdata path to match what `brew info tesseract` reports on your machine.

<Note>
  If you installed Tesseract via MacPorts instead of Homebrew, use `port install tesseract-<langcode>`, for example `sudo port install tesseract-fra`.
</Note>

***

## Docker

For containerised .NET applications, install language packs in your `Dockerfile` alongside Tesseract:

```dockerfile theme={null}
FROM mcr.microsoft.com/dotnet/runtime:8.0

# Install Tesseract with specific language packs
RUN apt-get update && apt-get install -y \
    tesseract-ocr \
    tesseract-ocr-eng \
    tesseract-ocr-fra \
    tesseract-ocr-deu \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY --from=build /app/publish .

ENTRYPOINT ["dotnet", "MyApp.dll"]
```

If the language you need is not available through `apt`, copy the `.traineddata` file directly into the image:

```dockerfile theme={null}
# Copy a manually downloaded .traineddata file into the image
COPY fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/fra.traineddata
```

<Tip>
  If Tesseract cannot find your language data at runtime, set the `TESSDATA_PREFIX` environment variable to the directory containing your `.traineddata` files:

  ```dockerfile theme={null}
  ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/
  ```
</Tip>

***

## Using a language with PDF4LLM

Once a language pack is installed, pass its code to `ToMarkdown()`, `ToText()`, or `ParseDocument()` via the `ocrLanguage` parameter:

```csharp theme={null}
using PDF4LLM;

// Single language
string md = PdfExtractor.ToMarkdown("document.pdf", useOcr: true, ocrLanguage: "fra");

// Multiple languages combined with +
string md = PdfExtractor.ToMarkdown("document.pdf", useOcr: true, ocrLanguage: "eng+fra+deu");
```

The same parameter works across all OCR-capable methods:

```csharp theme={null}
// Plain text with OCR
string text = PdfExtractor.ToText("document.pdf", useOcr: true, ocrLanguage: "jpn");

// Structured layout with OCR
ParsedDocument parsed = PdfExtractor.ParseDocument("document.pdf", useOcr: true, ocrLanguage: "ara");
```

***

## Common language codes

| Language            | Code      |
| ------------------- | --------- |
| English             | `eng`     |
| French              | `fra`     |
| German              | `deu`     |
| Spanish             | `spa`     |
| Italian             | `ita`     |
| Portuguese          | `por`     |
| Simplified Chinese  | `chi_sim` |
| Traditional Chinese | `chi_tra` |
| Japanese            | `jpn`     |
| Korean              | `kor`     |
| Arabic              | `ara`     |
| Russian             | `rus`     |
| Hindi               | `hin`     |

For the full list of supported languages and their codes, see the [Tesseract tessdata repository](https://github.com/tesseract-ocr/tessdata).
