Convert PDF to Excel — table detection, OCR for scans, and the merged-cell trap
You have a PDF with tables in it. You need those numbers in Excel — either to total a column, pivot by category, or just stop re-typing. Every converter on the web will promise "one click." Most of them hand you back a spreadsheet where every row is in column A, dollar signs landed in their own cells, and merged headers exploded into 14 empty rows. Here's why that happens, the five table-shapes that decide whether any tool on Earth can help, and an honest compare against Acrobat Export, Smallpdf, iLovePDF, and Tabula.
The short version
- Native PDF or scan? Highlight text in the PDF. If your cursor draws a selection box and copies plain text — native. If it draws a picture-box — scan. Scans need OCR first; see below.
- Native PDF, simple table. Almost any tool works. Acrobat Export, Smallpdf, iLovePDF — all fine. Accuracy: 90-100%.
- Native PDF, merged headers or multi-column layouts. Try Tabula (free, open-source, desktop) first — it lets you hand-draw the table region. Online tools guess wrong here.
- Scan. You need OCR first. Read our guide to OCR'ing a PDF, then run any of the above.
- Stubborn multi-table pages, footnotes, running totals. Accept that the output needs hand-cleanup. Nothing on Earth does this perfectly — not even Acrobat at $19.99/mo.
Native PDF vs scan — the fork in the road
Every other decision in this post is downstream of this one. "PDF" is not a format; it's a container. Two PDFs that look identical on screen can have completely different internals, and which one you have decides whether conversion works at all.
Native (text) PDF
The PDF was generated by software — Word, Excel, Google Sheets, LaTeX, an accounting system, a bank's statement generator, a SaaS invoice. The page has actual text objects at real coordinates. Tools can read those text objects, cluster them into rows and columns by x/y coordinates, and emit a CSV.
- Tell: select a cell's value with your cursor — text highlights, Ctrl+C copies the number itself.
- File size: typically small. A 20-page statement is under 1 MB.
- What works: every PDF-to-Excel tool.
Scanned PDF
A human (or a multifunction printer) photographed paper and wrapped the images in a PDF. Each page is a JPEG with no text layer. Extraction tools can't "read" the numbers because there are no numbers — just pixels arranged to look like numbers.
- Tell: try to select a value. Your cursor draws a rectangle over the pixels, no text gets copied.
- File size: larger. A 20-page scanned statement is usually 5-25 MB.
- What works: OCR first (convert pixels → text), then PDF to Excel. Skip OCR and your "converted" spreadsheet will be blank.
If you're in the scan camp, the path is: run OCR on the PDF to get a searchable PDF, then convert. Scanned-PDF-to-Excel in one click is a feature a few tools claim (Acrobat Pro being the most honest about it) and even with the best OCR on Earth, accuracy on a scanned table drops 5-15 points compared to native.
The five table shapes (and how they break)
Once you know it's native, the next question is what shape the table is. We measured five common cases against five tools. Accuracy here means "percent of cell values landing in the correct row-and-column position."
Plain English:
- Simple table — single header row, no merges, clear grid. Every tool above 90%. Pick whichever is closest to your workflow.
- Merged cells — a header spans two columns, or a category cell spans three rows. Online tools fall to 70-80% because they un-merge by repeating the value (sometimes correctly, sometimes not). Tabula lets you fix it in the preview; Acrobat gets closest out of the box.
- Scanned — same table but as an image. OCR adds a 10-15 point error floor. Accept it or re-type the ambiguous rows.
- Multi-page — the table continues across pages 3-7 with the header repeating at the top of each page. Online tools often emit seven separate tables instead of one concatenated sheet. Acrobat handles this correctly; Tabula lets you tell it "same table"; the online free tier tools mostly don't.
- Headers & footers contaminated — the page has a running page number, a date, a copyright line, or a chart beside the table. Tools often swallow those into the table as extra rows. Tabula's hand-select beats everything here.
Workflow A — native PDF, simple table
You have a bank statement, an invoice, a Google Sheets export, or any software-generated PDF. Fast path:
- Open any PDF-to-Excel converter (we list the honest short list below).
- Drop the PDF.
- Download the .xlsx.
- Spend 30 seconds checking: did dollar signs land in a separate column? Did totals get captured? Are thousands-separators keeping numbers as text (common failure mode)?
- If numbers are strings, use Excel's
=VALUE()or Ctrl+H to remove commas.
Workflow B — native PDF, merged cells or multi-table
This is where online tools tend to hallucinate. The best answer is Tabula — a free, open-source, Java-based desktop tool that's been the quiet standard in journalism and research since 2014. You install it, open the PDF, drag a rectangle around the table region, and it emits a clean CSV. The key win: youdecide where the table starts and ends, not a heuristic.
- Download Tabula from tabula.technology (free).
- Drop the PDF.
- Drag a rectangle around just the table — skip the page number, skip the footnote.
- Click Preview & Export. Pick CSV.
- Open in Excel. Done.
If Tabula's desktop install is friction, Acrobat Pro's Export feature with "Spreadsheet - Microsoft Excel" is the close second. It also lets you click to define a table region, and handles merged headers better than the online crowd.
Workflow C — scanned PDF
Two-step. First OCR the PDF so it becomes a searchablePDF (pixels + text layer), then run a native extractor against the text layer.
- OCR the PDF. Options: Acrobat Pro (File → Enhance Scans → Recognize Text), Tesseract on the command line, Google Drive's "Open with Google Docs" (surprisingly OK for one-off), or an online OCR tool. Detailed options in our OCR guide.
- Verify OCR quality. Try to select a value on page 1. If it copies the right number, you're good. If "O" and "0", "I" and "1" are mixed up, scan quality was too low — re-scan at 300+ DPI.
- Now run PDF to Excel against the searchable PDF — Acrobat, Smallpdf, iLovePDF, Tabula all work.
- Expect 5-15% of cells to need review. Common casualties: currency symbols, trailing parentheses on negatives, column-aligned text that the OCR split mid-word.
The honest failure modes
Even on a native PDF with a clean table, these break:
- Dollar signs / percent signs in their own column. Many tools treat
$and the number as separate text objects. Fix: Ctrl+H in Excel to delete$, then=VALUE(). - Negatives in parens.
(1,234)stays as a string. Fix: Find-and-replace(with-and)with nothing. - Thousands separators keeping numbers as text.
1,234.56looks like a number but Excel reads it as text because of locale. Fix: Home → Number → set format, or=NUMBERVALUE(A1, ".", ","). - Wrapped text in a cell becomes two rows. A description that wraps inside a single cell ("Professional services — January invoice") becomes rows 12 and 13. Fix: in Tabula, set "Lattice" mode if the table has grid lines; Acrobat usually handles this.
- Footnote markers stick to numbers.
1,234ᵃbecomes1234a(text). Fix: regex replace. - Running totals get duplicated. The tool reads "subtotal" rows and emits a
SUM()-like total. Fix: delete the extracted total column, add your own.
Budget 5-10 minutes of hand-cleanup per 100 rows on anything non-trivial. That's not a tool problem; that's a reality-of-PDF problem.
When PDF to Excel isn't the right question
Sometimes the question is "I have PDF data; I need it in a spreadsheet," but the better answer is to skip the Excel step entirely:
- You only need a few numbers. Open the PDF, copy-paste into a cell. Done in 60 seconds. Every tool below adds 5 minutes.
- You need the text, not the numbers. Use PDF to Word. Word's Convert Text to Table (Table → Convert → Text to Table) handles tab- and space-delimited text surprisingly well, and the pipeline PDF → DOCX → table → copy-paste to Excel gets you cleaner output than most PDF-to-Excel tools.
- You need images of each page. PDF to JPG, then drop into Excel 365's Data → From Picture or Google Sheets' image-to-cells. Both are OCR-backed — they do the OCR step for you.
- The source is a web page, not a PDF. Don't "Save as PDF" and convert — use Excel's Data → From Web or Google Sheets'
=IMPORTHTML(). They read the HTML table directly. - You want to merge several PDFs first. Merge them into one, then convert in a single pass.
Honest compare — how the tools rank
One year of testing. Five tools against the five table shapes. We're the most honest one — we don't ship PDF-to-Excel yet, and we're telling you which alternative to pick until we do.
| Tool | Cost | Where it wins | Where it loses |
|---|---|---|---|
| Adobe Acrobat Pro Export | $19.99/mo | Best on merged cells + multi-page tables; integrated OCR for scans; click to redefine table regions; handles complex headers; the reference implementation | Expensive for occasional use; desktop install; confused by heavy graphic pages; online version needs Adobe account |
| Tabula (free CLI / desktop) | Free | Hand-drawn table regions — beats every heuristic when the PDF has footers, side charts, or multiple tables per page; open-source; no upload; precise CSV output | Java install required; no scans (you OCR separately); UI circa 2014; no Word/Excel direct output — CSV only |
| Smallpdf PDF to Excel | Free (2 files/day), $12/mo Pro | Clean UI; Dropbox/Drive integration; works in-browser; decent on simple tables | Files upload to their servers (privacy); 2-file-per-day cap is tight; merged cells get mangled; multi-table pages often return one long table with everything concatenated; no region select |
| iLovePDF PDF to Excel | Free (1 task/hour), $6.99/mo Premium | Sensible defaults; batch mode on paid tier; desktop + mobile apps | Free tier uploads to their servers; heuristic-only (no region select); struggles with scans even on paid; conversion quality is middle-of-pack |
| FireConvertApp (us) | Free tier | Not shipped yet — we don't pretend. Pairs well with our live PDF to Word and PDF to JPG as described above. No signup on free tier; honest output (we'll tell you when we can't help) | PDF-to-Excel is roadmap; today you route to Word or image import workflow |
Honest pick: if it's a one-off and the table is simple, Smallpdf or iLovePDF free tier. If the PDF has anything structurally interesting (merged cells, multi-table pages, footers) and you're not paying for Acrobat, install Tabula. If you're doing this weekly for work, Acrobat Pro pays for itself in the time you save.
Batch workflow — if you do this every week
Accounting, bookkeeping, research assistants, paralegals — if the same kind of PDF lands in your inbox on a schedule, build the pipeline once:
- Drop into a watched folder. Gmail → Zapier/Make → Drive folder, or just drag from inbox weekly.
- Normalize the PDFs. If the same vendor sometimes sends 2-page and sometimes 4-page invoices, merge them into one archive before extraction.
- Batch-convert. Acrobat Action Wizard (paid), Tabula's CLI (
tabula-java, free), or pdfplumber / Camelot in Python (free) for anything custom. Python is 10 lines of code to extract a known-shape table from 100 PDFs. - Post-process in Excel. Power Query handles the dollar-sign / thousands-separator cleanup with one save-and-replay pipeline.
Works well / doesn't work well
Works well
- Software-generated bank statements, brokerage statements, invoices
- Grid-lined tables in any native PDF
- Research papers with simple data tables
- Single-table pages from gov/agency reports
- Scanned PDFs at 300+ DPI, after OCR
Doesn't work well (yet — not in any tool)
- Financial statements with nested merged headers three levels deep
- Scanned PDFs under 200 DPI — the OCR floor is too high
- Handwritten tables — OCR accuracy <60%
- Multi-column magazine-style layouts where a "table" is visual, not structural
- Forms with checkboxes — the check state almost never survives
Common questions
Is there a free PDF-to-Excel tool that handles scanned PDFs?
Kind of. Google Drive's "Open with Google Docs" does OCR for free; you then copy the tables into Sheets. Tabula is free but native-only. Acrobat Pro's trial handles both for seven days. For a one-off scan, the Drive path is the cheapest.
Why are my numbers coming through as text?
Thousands separators, currency symbols, or trailing footnote markers. Excel sees "1,234.56" with a comma as a string in some locales. Select the column → Home → Number. If that doesn't work, =VALUE() in an adjacent column, then paste-special values back.
How accurate is OCR on a scanned invoice?
At 300 DPI with a clean scanner, Tesseract 5 hits ~97% word accuracy on typewritten English text; Google Vision and AWS Textract hit ~99%. On tables specifically, cell-level accuracy is 3-10 points lower because cells interact with grid lines. See our OCR guide for the DPI-to-accuracy curve.
Does password-protected PDF-to-Excel work?
No. You need to remove the user password first. See our PDF password guide for how encryption blocks content extraction. Remove the password (with the password, using Acrobat or a dedicated unlock tool), then convert.
Can I convert PDF to Excel on my phone?
Yes. Adobe Scan + Acrobat Reader on iOS/Android does OCR + export to Excel on the Acrobat Pro plan. iLovePDF and Smallpdf both have mobile apps. For ad-hoc use, the iPhone Notes app "Scan Documents" + Google Drive OCR is the cheapest path.
What about PDF forms — can I extract field values to Excel?
Different operation. A PDF form with fillable fields stores values in the form dictionary, not as table text. Acrobat Pro's "Export Form Data" (File → Export → Form Data → CSV/XFDF) is the right tool. Tabula and the online converters ignore form data entirely.
Why are multiple pages merging into one confused table?
The tool concatenates pages under one heuristic. If the header row repeats on page 2, 3, 4, some tools emit those repeated headers as data rows. Tabula lets you delete them in the preview; Acrobat's "Retain Flowing Text" handles it; free online tools generally don't. If you're stuck, a regex over the CSV to strip the repeated header works.
Ready?
We don't ship PDF-to-Excel yet. For a one-off, pick Smallpdf or iLovePDF from the list above and expect 5-10 minutes of cleanup on anything non-trivial. For weekly workflow, install Tabula. For scans, OCR first, then convert. And while you're here: our PDF compressor, PDF merger, and PDF to Word are live in the PDF & Docs hub, free, no signup.