PDF Repair

Recover corrupted or broken PDFs with a three-stage repair engine — runs locally.

Three-stage repair engine

1

Strict parse

Validates PDF structure. If valid, outputs a clean optimised copy.

2

Lenient parse

Ignores invalid objects and recovers as much structure as possible.

3

Page rendering

Last resort: renders each page to an image and rebuilds the PDF. Text selection is lost but the document is viewable.

Stage 3 note: If structural recovery fails, each page is rendered as a JPEG image — text will not be selectable in the output. Stages 1 & 2 always preserve text.
Runs entirely in your browser. No uploads. Your files stay private.

How PDF Repair Works — Three-Stage Recovery from xref to Rasterization

PDF is a structured binary format: a header, a sequence of indirect objects, an xref table that records the byte offset of every object, and a trailer. When any of those parts are damaged — by a failed download, interrupted save, bit rot on archival media, or a corrupted USB transfer — most viewers refuse to open the file. PDF Repair attempts to rescue what is recoverable using a three-stage pipeline running entirely in your browser.
Stage 1 is strict parsing with pdf-lib. The library reads the file as the spec describes and produces a clean rewrite if the structure is valid. This succeeds on PDFs with minor formatting issues that strict viewers reject — non-compliant headers, missing whitespace, slightly malformed xref entries — and produces a standards-compliant output identical in content to the original.
Stage 2 is lenient parsing, also via pdf-lib but with the throwOnInvalidObject option disabled. The parser skips broken objects, reconstructs the xref table by scanning for object headers, and rebuilds the trailer. Text, fonts, hyperlinks, and vector graphics survive when their underlying objects are intact. Pages whose objects are unreadable are dropped from the output.
Stage 3 is rasterization through pdfjs-dist. PDF.js is far more tolerant of corruption than pdf-lib because it was designed to render anything browsers might encounter. Each page that PDF.js can render is captured to a canvas, JPEG-encoded, and stitched into a fresh PDF using pdf-lib. This stage produces a viewable file even from heavily corrupted inputs but loses text selection, search, and copy-paste because every page becomes an image.
The pipeline runs stages in order and stops at the first that succeeds. If Stage 1 produces a usable file, Stages 2 and 3 are skipped. If all three fail, the engine reports which stage failed and why — usually because the file was truncated below a recoverable threshold or because the entire content stream was overwritten with zeros.
Realistic limits: PDFs that have been partially overwritten with garbage, encrypted with a key that is no longer available, or truncated to less than ~200 bytes are unrecoverable. Files damaged by bit-flips on a few bytes usually recover at Stage 1 or Stage 2. Files that crashed mid-save often recover at Stage 2 with the last few pages missing.
All three stages run in your browser — pdf-lib for parsing and rebuilding, pdfjs-dist (with its WebAssembly worker) for rendering. No file leaves your device. After repair, run the result through PDF Compressor if Stage 3 was used, since rasterized pages are much larger than the original text/vector content.

Common Use Cases

01

Recover a truncated download

A PDF cut off mid-download often has a corrupt trailer but mostly intact pages — Stage 2 typically recovers the readable portion.

02

Fix bit-rot on old archives

Backups on aging hard drives or optical media occasionally develop bit-flips that confuse strict parsers; lenient mode usually rebuilds them.

03

Salvage a partial save

When an application crashed mid-save, Stage 2 can recover content that was already written, often with only the final page or two missing.

04

Open non-compliant PDFs

Some PDFs from older or specialized generators don't pass strict spec validation. Stage 1 produces a clean, compliant copy that opens in every reader.

Frequently Asked Questions

No. Files where the content streams have been overwritten with zeros, where the file has been truncated below ~200 bytes, or where encryption keys are missing are unrecoverable. The engine reports which stage failed and why.
Yes if recovery succeeds at Stage 1 or Stage 2 — those preserve the original text, fonts, and hyperlinks. Stage 3 rasterizes each page to a JPEG, so text is no longer selectable in the output. Run the result through PDF OCR to restore searchability.
No. All three stages — pdf-lib parsing, lenient reconstruction, and pdfjs-dist rendering — run entirely in your browser. Files stay in tab memory and are never transmitted.
Stage 2's lenient mode can sometimes recover partially encrypted-but-broken files. Fully encrypted files must be unlocked first with the PDF Password tool — pdf-lib cannot parse encrypted streams without the password.
Stage 3 embeds every page as a JPEG image, which is far less compact than compressed PDF text and vector data. A 2 MB text PDF might balloon to 20–40 MB after Stage 3. Run the output through PDF Compressor to reduce size.
Stage 1 and Stage 2 preserve form fields, annotations, and hyperlinks if the underlying objects are intact. Stage 3 flattens everything into the page image, losing interactivity.
Many viewers have stricter validation than the actual PDF spec requires. Stage 1 uses pdf-lib's parser plus a clean re-serialization, which produces a standards-compliant copy even when the source had cosmetic non-compliance issues.
Stage 1 and 2 are near-instant for most files. Stage 3 takes proportional to page count — about half a second per page on a modern laptop. A 50-page rasterization run takes 25–30 seconds.
There is no hard cap. PDFs up to about 200 MB repair comfortably. Larger files may pressure tab memory during Stage 3 rasterization since each page's canvas must be held while encoding.
Stage 1 and 2 preserve embedded font subsets exactly. Stage 3 rasterizes pages, so the font rendering is captured as pixels — the result looks identical but no longer carries the embedded font data.

Advertisement