Skip to main content

Overview

get_key_values() parses a PDF and extracts structured data from every form field (widget) it contains. It returns a list of dictionaries — one per field — each capturing the field name, its current value, and the pages on which it appears.
This method is only meaningful for Form PDFs — documents that contain interactive widgets. For non-form PDFs, it returns an empty list.

Signature

pymupdf4llm.get_key_values(doc: str | pymupdf.Document) -> list[dict]

Parameters

doc
str | pymupdf.Document
required
Path to the document file, or an already-opened pymupdf.Document instance. Supports PDF, XPS, eBooks, and — with PyMuPDF Pro — Office formats.

Return Value

Returns a list of dictionaries, where each dictionary represents one form field:
{
    field_name:             # Full field name; nested components separated by dots
    {
        "value": str,       # The current field value, cast to string
        "pages": list,      # 0-based page number(s) where the field appears
    }
    ...
}

Field Dictionary Properties

field_name
string
required
The fully-qualified name of the form field. For hierarchical forms, parent and child names are separated by dots (e.g. "section1.address.city").
value
string
required
The field’s current value, always represented as a string regardless of the original widget type (text, checkbox, radio button, etc.).
pages
list[int]
required
A list of zero-based page indices where this field is present. A field can appear on multiple pages (e.g. when a master field has multiple instances across pages).

Usage

Basic Example

import pymupdf4llm

result = pymupdf4llm.get_key_values("my_form.pdf")

for key, field in result.items():
    print(key, field["value"], field["pages"])

Example Output

Given a simple two-page application form, the output might look like:
{
  "applicant.name":  {"value": "Jane Smith",        "pages": [0]},
  "applicant.email": {"value": "jane@example.com",  "pages": [0]},
  "terms_accepted":  {"value": "Yes",               "pages": [1]},
  "signature":       {"value": "",                  "pages": [1]}
}

Behaviour Notes

If the document contains no widgets, get_key_values() returns an empty list []. It will never raise an error for this case — it is always safe to call.
Regardless of the original widget type — text box, checkbox, radio group, dropdown, or signature — the value is always returned as a str. For empty fields, this will be an empty string "".
A single logical field can appear on multiple pages. In this case the field appears once in the returned list, and pages will contain all page indices where the field is rendered (e.g. [0, 2, 4]).

Common Use Cases

Form Data Extraction

Pull structured responses from filled PDF forms — employment applications, tax documents, surveys — without manual copying.

RAG Pre-processing

Augment your Retrieval-Augmented Generation pipeline with clean, structured form data alongside the text content from to_markdown().

Data Validation

Check that required fields are filled before processing a submitted PDF form programmatically.

Form Auditing

Inventory all fields across a batch of PDF templates to confirm naming conventions and completeness.

See Also

TocHeaders

Detect table-of-contents style heading structure.

IdentifyHeaders

Detect and classify page headers and footers across a document for exclusion or analysis.

Extract Markdown

Practical guide to using margins in extraction.