April 25, 202612 min read

How to OCR a PDF — make a scanned PDF searchable, accurately

You have a scanned PDF. It looks right on screen, but Ctrl+F finds nothing, copy-paste gives you pictures, and every "PDF to Word" tool hands back a blank document. That's because the file has no text — just pictures of text. OCR (optical character recognition) is the step that turns those pictures back into words. Here's the honest version: what OCR actually does, why 300 DPI is the number that matters more than the engine, what kills accuracy, and an honest compare against Tesseract, Google Vision, AWS Textract, and Acrobat's built-in recognizer.

The short version

OCR turns pixels into text. The scanner or camera captured images; OCR looks at the images and writes a text layer beneath them. The PDF still shows the original image; the text is invisible-but-searchable.
DPI is the single biggest variable. 300 DPI is the floor for reliable text; 600 DPI for forms with small numbers. Below 200 DPI, no engine on Earth saves you.
Language matters. Tesseract ships 100+ language packs; install the right one. Mixed-language documents need both.
Skew, shadow, and handwriting kill accuracy. A camera photo with a curved spine or finger-shadow drops Tesseract to 60%. De-skew first.
Fastest one-off: drop the PDF in Google Drive, right-click → Open with Google Docs. Free OCR on 10 MB / 2 MB-per-page files.
Best accuracy: Google Vision or AWS Textract ($1.50 per 1,000 pages) for anything you'll keep.

What OCR actually does

OCR is a two-stage pipeline. First, page segmentation: the engine looks at a page image, detects which rectangles contain text (vs a photo, chart, or blank space), and decides the reading order — left to right, top to bottom, across columns, across pages. Second, character recognition: for each text rectangle, a neural network (in modern engines) or a shape-matching classifier (in older ones) predicts which letter each glyph is.

The output is a stream of characters with bounding boxes. Good OCR pipelines then write those characters as an invisible text layer into the PDF, positioned exactly behind the visible image. The file still looks identical. But now Ctrl+F works, selection works, and any PDF-to-Word or PDF-to-Excel tool can read the text layer and ignore the pixels.

The failure cases are both stages:

Segmentation can misread a two-column article as one flowing column, interleaving left and right. It can read table cells in the wrong order. It can miss rotated text entirely.
Recognition confuses look-alike characters: O vs 0, I vs l vs 1, rn vs m, cl vs d. Small type, low DPI, and ornamental fonts amplify every one.

The DPI cliff

If you remember one thing from this post, remember this: OCR accuracy is not a smooth curve from DPI to quality. It's a cliff. Below ~150 DPI, the engine guesses. At 200 DPI, it's legible but noisy. At 300 DPI, accuracy jumps sharply into the 95%+ range. At 600 DPI, you're gaining ~1 point for a huge file size. Here's the shape on a clean typewritten English test document with Tesseract 5:

Tesseract 5 word accuracy on a clean English typewritten test page, by scan DPI. Source: measured on 2026-04-22, 10-page test corpus, LSTM mode, no pre-processing beyond de-skew.

The takeaway: the cheapest way to improve OCR accuracy is not a better engine. It's a better scan. A $100 scanner at 300 DPI out-performs a $10,000-a-month SaaS OCR on a phone camera photo at 144 DPI. If you control the scan, go 300 DPI minimum. If the file came in at 150 DPI, every engine will stumble on the same characters.

Recommended DPI by content type:

Ordinary typewritten documents — 300 DPI.
Small type, footnotes, dense academic text — 400 DPI.
Forms with hand-printed numbers, receipts — 600 DPI; the numbers are small and the margin for error is zero.
Photographs / mixed-content pages — 300 DPI is fine; the photo compresses fine at that rate.
Architectural drawings, maps — 600+ DPI; tiny labels need every pixel.

Phone cameras are a special case. A 12-megapixel iPhone shooting a letter-sized page from 30 cm delivers roughly 400 DPI at the center — but edge sharpness falls off, and hand-held skew and shadow typically cost 10+ accuracy points. Phone-scanner apps (Adobe Scan, Microsoft Lens, iOS Notes' Scan Documents) all de-skew and flatten; they're dramatically better than raw photos.

What kills OCR accuracy

In rough order of how badly each one hurts:

Low DPI. Covered above. The dominant factor.
Skew. Pages scanned at an angle. 1-2° of skew: Tesseract handles it. 5°+: accuracy drops fast. Fix: pre-process with a de-skew step (Acrobat does this automatically; unpaper or ScanTailor do it free).
Handwriting. Print handwriting: 60-75% accuracy on AWS Textract / Google Vision, 30-50% on Tesseract. Cursive: 30-60% Vision/Textract, near-random on Tesseract. If the document is mostly handwritten, budget for hand-review or use a specialist (Google Cloud Document AI, Textract custom).
Multi-column layouts. Newspaper, magazine, academic journal pages. Tesseract's default page segmentation often flows columns as one paragraph. Fix: tesseract --psm 3 (Auto page segmentation, the default) usually handles it, but you may need to crop to one column at a time for dense layouts.
Shadows and uneven lighting. A finger casting a shadow, a bent page spine. Pre-process to normalize brightness. Acrobat, Adobe Scan, and Microsoft Lens all do this; Tesseract does not.
Ornamental or non-standard fonts. Old typewriter fonts, blackletter, heavy italics. Accuracy drops 5-20 points. Training a custom Tesseract model is possible (tesstrain) but that's a weekend project.
Non-Latin scripts. Tesseract ships with good Chinese, Japanese, Arabic, Russian, etc., but you must install the language pack. Mixed-language documents need both installed.
Watermarks, stamps, colored backgrounds. "CONFIDENTIAL" at 45° across every page becomes text-on-text and kills recognition. Fix: convert to grayscale and binarize first.
JPEG compression artifacts. If the source PDF is a low-quality JPEG (quality 40-60), the compression noise itself gets read as text. Re-scan if you can.

OCR engines — the honest field guide

Tesseract

Open-source, free, command-line. Tesseract 5 (LSTM-based) hits ~97% word accuracy on clean 300-DPI English typewritten text — within a few points of the commercial APIs. Strengths: free, runs locally (privacy), scriptable, 100+ languages. Weaknesses: sensitive to skew and shadows (you pre-process), handwriting is near-unusable, table detection is basic, setup is CLI-only.

One-liner after install:

tesseract input.png output -l eng pdf

For a multi-page PDF, pair with pdftoppm to rasterize first, then concatenate. The ocrmypdfwrapper (Python, MIT) does the whole pipeline — de-skew, binarize, OCR, rebuild PDF — in one command:

ocrmypdf input.pdf output.pdf

Google Cloud Vision / Document AI

Cloud API. $1.50 per 1,000 pages for standard; Document AI's form parser is separate and more expensive. Strengths: best accuracy on handwriting we've tested (85%+ on printed hand lettering), handles skew and lighting variation gracefully, returns rich layout info (paragraphs, tables, reading order). Weaknesses: files upload to Google (compliance concern for some industries), cost adds up on large archives, not offline.

AWS Textract

Cloud API. $1.50 per 1,000 pages text-only; $15 per 1,000 pages for table + form extraction. Strengths: table extraction is the best in the industry — it reconstructs row/column structure, not just text. Handles forms (key-value pairs) natively. Same uploads-to-cloud caveat. Our pick if you're OCR'ing scanned bank statements, invoices, or forms at scale.

Adobe Acrobat Pro — Recognize Text

$19.99/mo. Strengths: desktop, no upload, integrated with every other Acrobat operation (export to Word, redaction, forms), sensible defaults, handles skew and multi-column correctly. Weaknesses: expensive for occasional use, English and major EU languages only by default (other language packs available but awkward to install).

Microsoft OneNote / Word OCR

Free with a Microsoft 365 subscription. Open the PDF in OneNote, right-click an image → Copy Text from Picture. Word 2019+ also opens scanned PDFs with "recognize text" under the hood. Quality: mid-tier — better than Tesseract defaults, not as good as Vision. Nice if you're already paying for 365.

Google Drive / Docs

Free. Upload PDF to Drive, right-click → Open with Google Docs. Google runs its OCR on the first 10 MB / first ~50 pages, returns a Docs document with the text. Quality surprising — it's the same engine behind Vision. Perfect for one-off scans where you don't need a batch pipeline.

Scanned-PDF-to-searchable-PDF vs PDF-to-Word

People often conflate these two. They're different:

Searchable PDF — the PDF still looks identical (you see the scan), but a hidden text layer sits behind each page. Ctrl+F works, selection works, every PDF reader handles it. This is what ocrmypdf and Acrobat's "Recognize Text" produce. Target when you want to archive the original look and also be able to search.
PDF to Word — the PDF is converted into a .docx file. The original page layout is approximated in Word's paragraph / table model. Fonts may shift, tables may reflow. Target when you want to edit the content, not just search.

The pipeline: OCR the scan to get a searchable PDF, then run PDF to Word against the searchable PDF. If you skip the OCR, PDF-to-Word hands you back a DOCX with only images in it — no editable text.

Workflow A — one-off scan, cheapest path

Drop the PDF into Google Drive.
Right-click → Open with → Google Docs. Wait ~30 seconds.
You get a Docs document with the OCR'd text. Copy-paste or File → Download.
If you need a searchable PDF (not a Docs file), install ocrmypdf locally (brew install ocrmypdf / pip install ocrmypdf) and run the one-liner above.

Workflow B — batch of 100+ scanned documents

Scan at 300 DPI, grayscale (not color — half the file size, same OCR accuracy).
ocrmypdf in a shell loop: for f in *.pdf; do ocrmypdf "$f" "ocr_$f"; done.
Output: searchable PDFs, same look, Ctrl+F works.
If you need extracted tables, feed the searchable PDFs into PDF-to-Excel workflow (Acrobat batch, Tabula, or Textract).
For privacy-sensitive content: stay local. ocrmypdf / Tesseract / Acrobat desktop — nothing leaves your machine.

Workflow C — mobile / phone scan

Use a dedicated scanner app, not the raw camera: Adobe Scan (free), Microsoft Lens (free), or iOS Notes "Scan Documents".
These apps auto-detect page edges, de-skew, sharpen, and binarize — all of which pre-process better than any desktop OCR can.
Adobe Scan has built-in OCR (free tier, English only). Microsoft Lens exports to OneNote or Word with OCR applied.
For other languages, export the processed PDF out of the scanner app, then OCR with ocrmypdf locally.

When to OCR vs when to re-type

Under one page, hand-written, or a single number you need? Re-type. OCR's setup cost isn't worth it.

Anything over three pages, printed, and English? OCR. Tesseract+ocrmypdf takes 30 seconds to install and 5 seconds per page to run.

Hundreds of pages, handwritten, mission-critical (legal discovery, medical records)? Hire a service (AWS Textract or Google Document AI wrapped in a pipeline) or use a specialist vendor. Don't trust Tesseract on cursive.

Privacy — what leaves your machine

If the PDF contains PHI, PII, trade secrets, or anything under NDA:

Local, no network: Tesseract, ocrmypdf, Acrobat Pro desktop. Nothing uploads.
Uploads (check your policy first): Google Drive OCR, Google Vision, AWS Textract, Microsoft 365 cloud OCR, Smallpdf's OCR, iLovePDF's OCR. All process server-side. Some offer data-residency controls on enterprise plans.

For healthcare / legal / finance: the PDF password guide covers why even an OCR'd PDF should probably not be in your cloud OCR vendor's logs for 90 days. Run local.

Honest compare — accuracy on our test set

Five tools, 10 pages of test content: typewritten English, a scanned bank statement, a 1970s-typewriter letter, a two-column academic paper, and one page of hand-printing.

Tool	Cost	Clean typewritten	Hand-printing	Privacy
Tesseract 5 / ocrmypdf	Free	~97%	~35%	Local; nothing uploads
Google Cloud Vision	$1.50 / 1k pages	~99%	~85%	Files upload to Google
AWS Textract	$1.50 / 1k pages (text)	~99%	~82%	Files upload to AWS
Acrobat Pro Recognize Text	$19.99/mo	~98%	~55%	Local desktop; nothing uploads
Google Drive "Open with Docs"	Free (10 MB cap)	~98%	~78%	Files upload to Google
FireConvertApp (us)	Free tier	Coming soon	Coming soon	Will run locally in-browser (no upload)

Honest picks: ocrmypdf for anything you control the machine for — it's free, accurate, local, and scripted. Google Vision or Textract for anything at scale where uploads are OK. Acrobat Pro if you're already in the Adobe ecosystem. Drive's free OCR for single-file, low-stakes.

Works well / doesn't work yet

Works well

Clean 300+ DPI typewritten English at 97-99% accuracy on any engine
Major EU languages (French, German, Spanish, Italian, Portuguese)
Chinese, Japanese, Korean at 90%+ on modern engines (not the 2005 stereotype)
Business documents at 300 DPI — contracts, reports, statements
Hand-printing on Google Vision / AWS Textract (~85%)

Doesn't work well

Cursive handwriting (<60% even on the best engines; human review faster)
Sub-200-DPI scans (the cliff — no engine recovers)
Heavy ornamental fonts (blackletter, script)
Math equations (specialist engines exist — Mathpix, InftyReader)
Chemical structures, sheet music (domain-specific OCR needed)

Common questions

Will OCR make my PDF bigger?

Slightly — the text layer adds a few KB per page. ocrmypdf re-compresses the image during the process and often returns a file a few percent smaller. If size matters more than accuracy, compress the PDF after OCR.

Can I OCR a password-protected PDF?

No. Remove the user password first — see the PDF password guide — then OCR the unlocked file.

Does OCR change how the PDF looks?

Done correctly, no. The image stays visible, the recognized text lives behind it as an invisible layer. Printed output is identical. If your tool rasterizes-then-rebuilds the PDF, visible quality may change slightly; ocrmypdf avoids that by default.

What's the difference between OCR and text extraction?

OCR creates text from pixels (for scanned PDFs). Text extraction reads existing text objects (for native PDFs generated by software). If Ctrl+F already works in your PDF, you don't need OCR — you need a text extractor like PDF to Word or PDF to Excel.

Why does Ctrl+F still not find text after OCR?

Either the OCR step failed silently (low DPI, scan quality) or the tool wrote the text layer in a font your reader can't render at zero opacity. Open the PDF in a different reader (Acrobat Reader, Chrome's built-in viewer) and try again. If the text is there but garbled, re-scan at higher DPI.

Can I OCR a PDF on my iPhone / Android?

Yes. Adobe Scan (free, ads) and Microsoft Lens (free) both do OCR on the device. iOS Notes "Scan Documents" scans but doesn't do OCR directly; AirDrop to Mac and run Preview's Live Text, or pipe through the Drive workflow above.

How long does OCR take?

Tesseract: ~5 seconds per page on a modern laptop. Google Vision / Textract: a few seconds of round-trip per page via API. Acrobat Pro: similar. A 100-page book: roughly 10 minutes end-to-end with ocrmypdf; about the same via a cloud API but concurrent.

Ready?

Our in-browser OCR tool is on the roadmap — it'll run locally via WASM once it ships. Until then: Google Drive "Open with Docs" for a one-off, ocrmypdfon your laptop for anything repeatable, Google Vision / AWS Textract at scale. And while you're here: our live PDF compressor, merger, and PDF to Word all run free, in-browser, no signup.