Understanding PDF → EPUB
The conversion that's never as clean as you'd like.
Why PDF doesn't reverse cleanly into reflowable text, what the OCR fallback fixes (and breaks), and the manual cleanup that's usually unavoidable.
PDF is the wrong shape.
A PDF describes pages, not paragraphs. The text-on-page is positioned by absolute coordinates; line breaks are typographic decisions baked into the file. Extracting "paragraphs" means inferring the structure that was deliberately discarded. Converters do their best; the results are often crooked.
Two flavours of PDF.
Tagged PDF: structure is embedded — paragraphs, headings, reading order, alt text for images. Conversion works almost cleanly. Untagged PDF (most of them, especially older ones): the converter has to reconstruct the structure from visual cues — font size, position, indentation, whitespace. Imperfect, especially for academic papers with multi-column layouts, footnotes, and figures.
What converters do.
Calibre's PDF input plugin is the most-used open-source path. Adobe Acrobat's export-to-EPUB is the most thorough commercial option. Both: extract the text stream, segment into paragraphs by visual heuristics, identify headings by font size, embed images, build a TOC. Both: stumble on multi-column layouts, drop formatting that doesn't map (hand-typeset poetry breaks lines wrong), merge or split paragraphs at unhelpful places.
OCR for scanned PDFs.
A scanned-image PDF has no text layer — just bitmap images of pages. Conversion requires OCR (Tesseract, ABBYY) to extract the text first. Modern OCR is ~99 % accurate on clean type but drops to 90-95 % on faded scans. That's 5-10 wrong characters per typical page, every page. Manual proofreading is unavoidable; OCR is the start of the work, not the end.
A worked attempt.
A 200-page academic textbook PDF, untagged, two columns, footnotes, figures. Calibre's conversion: takes 30 seconds, produces an EPUB. Inspection: chapter headings detected mostly correctly (one was wrong because the font matched body text). Columns merged into single flow — usable. Figures embedded but captions orphaned from images. Footnotes interleaved with body text — confusing. Total manual cleanup time to make publishable: 4-6 hours. The automated conversion does 80 % of the work; the last 20 % is hands-on.
200-page textbook
auto + manual
Calibre extracts ; human fixes.
30s automatic + 4-6 hours cleanup
= 80/20 split
When to find the original.
If the book was published in the last decade, an EPUB version probably exists somewhere — the store, the publisher's site, a library lending program. Buying the ebook costs less than the conversion time pays in human attention. Reach for PDF → EPUB only when the EPUB genuinely doesn't exist, or you have a unique PDF (scanned out-of-print book, lecture notes, archival documents). For those cases, the converter is the right tool; for everything else, it's the wrong starting point.