PyMuPDF4LLM integrates with LangChain through a custom document loader that wraps to_markdown() and returns LangChain Document objects. Each document carries the page’s Markdown content in its page_content field and PyMuPDF4LLM’s page metadata in its metadata field.
from langchain_pymupdf4llm import PyMuPDF4LLMLoaderloader = PyMuPDF4LLMLoader("document.pdf")documents = loader.load()
For large documents, split pages into smaller chunks before embedding to improve retrieval precision. LangChain’s MarkdownHeaderTextSplitter is a natural fit because PyMuPDF4LLM output preserves Markdown headings:
from langchain_pymupdf4llm import PyMuPDF4LLMLoaderfrom langchain.text_splitter import MarkdownHeaderTextSplitterloader = PyMuPDF4LLMLoader("document.pdf")documents = loader.load()splitter = MarkdownHeaderTextSplitter( headers_to_split_on=[ ("#", "heading_1"), ("##", "heading_2"), ("###", "heading_3"), ])chunks = []for doc in documents: splits = splitter.split_text(doc.page_content) # Carry original page metadata forward into each chunk for split in splits: split.metadata.update(doc.metadata) chunks.append(split)print(f"Created {len(chunks)} chunk(s) from {len(documents)} page(s)")
MarkdownHeaderTextSplitter produces semantically meaningful chunks by splitting on headings rather than character count. This works especially well with PyMuPDF4LLM output because heading structure is faithfully preserved.
You can also use RecursiveCharacterTextSplitter for a simpler fixed-size approach:
For large documents or memory-constrained environments, use lazy_load() to yield documents one at a time rather than loading everything into memory at once:
from langchain_pymupdf4llm import PyMuPDF4LLMLoaderloader = PyMuPDF4LLMLoader("large-document.pdf")for doc in loader.lazy_load(): print(f"Page {doc.metadata['page']}: {len(doc.page_content)} chars") # Process one page at a time
Because each document carries source and page metadata, you can scope retrieval to specific pages or files using metadata filters:
vectorstore = Chroma.from_documents(all_documents, OpenAIEmbeddings())# Retrieve only from a specific source fileretriever = vectorstore.as_retriever( search_kwargs={ "k": 5, "filter": {"source": "annual-report.pdf"} })