PDF OCR — Scan to Text

Extract selectable text from scanned PDFs and images — 13 languages, runs locally.

Drop a PDF or image here

or click to browse · PDF, PNG, JPG, WebP

Runs entirely in your browser. No uploads. Your files stay private.

How Browser-Based OCR Works — Tesseract.js, DPI, and Language Models

PDF OCR converts pictures of words — scanned PDFs, photographs of pages, screenshots of receipts — into actual editable text using Tesseract.js, a WebAssembly port of Google's Tesseract OCR engine. The recognition runs entirely in your browser inside a Web Worker, so the main thread stays responsive while pages are processed sequentially.
For PDF inputs, pdfjs-dist renders each page onto an HTML canvas at 192 DPI in Fast mode or 288 DPI in Best mode. Higher DPI gives Tesseract more pixels per glyph, which matters most for small print, faded photocopies, and 6–8 pt footnotes. Image inputs (PNG, JPEG, WebP) are passed directly into Tesseract without re-rendering.
Tesseract uses LSTM-based recognition models trained per-language. The first time you OCR in a new language, the model file (typically 2–4 MB compressed) downloads from a CDN and is cached by the browser; subsequent runs reuse the cached model and start almost instantly. Thirteen languages ship out of the box: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Simplified Chinese, Japanese, Korean, Arabic, and Hindi.
Output is plain UTF-8 text plus a confidence score (0–100) per page. Confidence under about 60 usually means the source image is too low-resolution or the wrong language was selected — try Best mode first, then verify the language. Tesseract preserves line breaks but does not reconstruct columns or tables; multi-column PDFs read top-to-bottom in one column then jump to the next, which can scramble reading order.
Practical limits: Tesseract is good at clean printed text and acceptable on typewritten or moderately faded copy. It struggles with handwriting (cursive is essentially unreadable), heavily skewed scans, dense diacritics in poorly trained languages, and stylized display fonts. For mission-critical OCR (legal discovery, archival digitization), commercial engines like Abbyy FineReader or Google Document AI still produce noticeably better results.
Long PDFs are processed page-by-page, with progress shown per page. A 100-page scanned PDF in Best mode takes several minutes on a typical laptop because each page must be rasterized and run through the LSTM model. To speed things up, split very long PDFs with the PDF Splitter and run pieces in parallel browser tabs.
Encrypted PDFs cannot be parsed by pdfjs-dist until they are unlocked — use the PDF Password tool first. Files never leave your device: pdfjs-dist runs in the main thread, Tesseract runs in a Web Worker, and the only network requests are the initial WASM and language model downloads.

Common Use Cases

01

Digitize scanned contracts

Turn a multi-page scanned PDF contract into editable text you can search, redline, paste into Word, or import into a contract management system.

02

Extract text from photos

Pull the words off a whiteboard photo, a snapshot of a printed page, or a screenshot of a slide where the text is not selectable.

03

Make image-only PDFs searchable

Convert scan-to-PDF outputs from a copier into plain text you can grep, full-text-index, or feed into a search engine.

04

Multi-language document processing

Recognize Latin, Cyrillic, CJK, Arabic, and Devanagari scripts using language-specific Tesseract models — including bilingual paperwork.

Frequently Asked Questions

No. OCR runs entirely in your browser. pdfjs-dist renders pages on a canvas, Tesseract.js does the recognition in a Web Worker, and only the initial WASM and language-model files are fetched from a CDN. Your document stays in tab memory.
Thirteen out of the box: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Simplified Chinese, Japanese, Korean, Arabic, and Hindi. Each language uses a Tesseract LSTM model that downloads on first use and caches in the browser.
The Tesseract WASM core (~2 MB) and the chosen language model (2–4 MB) are downloaded from a CDN and cached on first use. Subsequent runs in the same browser skip the download and start in under a second.
Fast mode renders PDF pages at ~192 DPI; Best mode at ~288 DPI. The same Tesseract engine reads both, but Best gives more pixels per glyph and produces noticeably better accuracy on small or faded text at the cost of about 2× the processing time.
No, not reliably. Tesseract's standard LSTM models are trained on printed text. Cursive and casual handwriting come out as gibberish; carefully hand-printed block letters sometimes work but should not be relied on.
Clean printed text from a 300 DPI scan typically achieves 95–99 % accuracy. Photographs of pages, faded photocopies, and small fonts drop to 80–90 %. Confidence scores under 60 usually indicate the source is too noisy — re-scan or use Best mode.
Tesseract reads top-to-bottom and does not reconstruct page layout. Multi-column scans tend to read down one column and then the next, scrambling paragraphs. For column-aware OCR, a commercial engine is required.
No. Unlock the PDF first using the PDF Password tool, then run OCR on the unlocked file.
On a typical laptop, Fast mode processes 1–2 pages per second; Best mode is roughly half that. A 100-page PDF in Best mode is around 2–4 minutes. The progress bar updates per page.
Just text right now. The output is plain UTF-8 you can copy or save as .txt. To embed the OCR text back into the PDF as a searchable layer, a desktop tool like ocrmypdf is currently the best option.

Advertisement