Overview
PyMuPDF4LLM’s extraction functions return strings or Python objects — writing them to disk is handled by standard Python. The recommended approach is pathlib.Path, which is clean, cross-platform, and available in the standard library with no additional dependencies.
Saving Markdown
import pymupdf4llm
from pathlib import Path
md_text = pymupdf4llm.to_markdown( "document.pdf" )
Path( "output.md" ).write_text(md_text, encoding = "utf-8" )
Always specify encoding="utf-8" when writing text files to ensure special characters, symbols, and non-Latin scripts are preserved correctly.
Saving JSON
Use Python’s built-in json module to serialise the output before writing:
import pymupdf4llm
import json
from pathlib import Path
data = pymupdf4llm.to_json( "document.pdf" )
Path( "output.json" ).write_text(
json.dumps(data, indent = 2 , ensure_ascii = False ),
encoding = "utf-8"
)
indent=2 produces human-readable JSON. For large documents where file size matters, omit it to write compact single-line JSON:
Path( "output.json" ).write_text(
json.dumps(data, ensure_ascii = False ),
encoding = "utf-8"
)
Saving Plain Text
import pymupdf4llm
from pathlib import Path
text = pymupdf4llm.to_text( "document.pdf" )
Path( "output.txt" ).write_text(text, encoding = "utf-8" )
Saving Page Chunks
When using page_chunks=True, you’ll typically want to save each page as a separate file. Use the page number from the chunk metadata to name each file:
import pymupdf4llm
from pathlib import Path
output_dir = Path( "output/pages" )
output_dir.mkdir( parents = True , exist_ok = True )
chunks = pymupdf4llm.to_markdown( "document.pdf" , page_chunks = True )
for chunk in chunks:
page_num = chunk[ "metadata" ][ "page" ]
filepath = output_dir / f "page- { page_num } .md"
filepath.write_text(chunk[ "text" ], encoding = "utf-8" )
print ( f "Saved { filepath } " )
Saving with a Matching Filename
To derive the output filename from the input document automatically:
import pymupdf4llm
from pathlib import Path
input_path = Path( "reports/annual-report-2025.pdf" )
md_text = pymupdf4llm.to_markdown( str (input_path))
output_path = input_path.with_suffix( ".md" )
output_path.write_text(md_text, encoding = "utf-8" )
print ( f "Saved to { output_path } " )
# Saved to reports/annual-report-2025.md
Path.with_suffix() swaps the file extension cleanly, keeping the same directory and stem.
Saving to a Different Directory
To write output to a different folder while keeping the original filename:
import pymupdf4llm
from pathlib import Path
input_path = Path( "source/document.pdf" )
output_dir = Path( "extracted" )
output_dir.mkdir( parents = True , exist_ok = True )
md_text = pymupdf4llm.to_markdown( str (input_path))
output_path = output_dir / input_path.with_suffix( ".md" ).name
output_path.write_text(md_text, encoding = "utf-8" )
print ( f "Saved to { output_path } " )
# Saved to extracted/document.md
Processing Multiple Files
To extract and save output for an entire folder of PDFs:
import pymupdf4llm
from pathlib import Path
input_dir = Path( "documents/" )
output_dir = Path( "extracted/" )
output_dir.mkdir( parents = True , exist_ok = True )
pdf_files = list (input_dir.glob( "*.pdf" ))
print ( f "Found { len (pdf_files) } PDF(s)" )
for pdf_path in pdf_files:
print ( f "Processing { pdf_path.name } ..." )
try :
md_text = pymupdf4llm.to_markdown( str (pdf_path))
output_path = output_dir / pdf_path.with_suffix( ".md" ).name
output_path.write_text(md_text, encoding = "utf-8" )
print ( f " ✓ Saved to { output_path } " )
except Exception as e:
print ( f " ✗ Failed: { e } " )
print ( "Done." )
Saving Images Alongside Markdown
When write_images=True is used, images are written to disk automatically during extraction:
import pymupdf4llm
from pathlib import Path
image_dir = Path( "output/images" )
image_dir.mkdir( parents = True , exist_ok = True )
md_text = pymupdf4llm.to_markdown(
"document.pdf" ,
write_images = True ,
image_path = str (image_dir),
image_format = "png" ,
dpi = 150
)
Path( "output/document.md" ).write_text(md_text, encoding = "utf-8" )
Image paths in the Markdown output are relative to wherever the .md file is opened from. Keep your Markdown file and image directory in the same parent folder to ensure image links resolve correctly.
Output Function Extension Write Method Markdown to_markdown().mdPath.write_text()JSON to_json().jsonjson.dumps() + Path.write_text()Plain text to_text().txtPath.write_text()Page chunks to_markdown(page_chunks=True).md per pagePath.write_text() in a loopImages to_markdown(write_images=True).png / .jpegWritten automatically
Next Steps
Extract Markdown Full walkthrough of to_markdown() with all common options.
Extract JSON Bounding boxes and layout data for custom pipelines.
Extract Text Plain text extraction and whitespace handling.
Images & Graphics Controlling image extraction, DPI, format, and output path.