Understanding PDF → Word
A page of glyphs becomes a stream of words.
Why PDF-to-Word is harder than it sounds, what gets preserved, and the difference between "extract text" and "reconstruct the document".
PDF is a print format.
A PDF page is a flat list of drawing instructions: "draw the glyph 'A' at coordinates (100, 700) in this font at this size". There's no concept of paragraphs, lists, headings, or reading order. A PDF can be perfectly legible on screen while being internally chaos. Converting back to Word means inferring the structural information the PDF deliberately discarded — which paragraphs are headings, which lines are list items, which floating text is a footnote.
Three layers of fidelity.
The easiest output is plain text: a stream of words in something close to reading order, no formatting. Reasonable accuracy. The middle option is text with basic formatting: bold, italic, font size, paragraphs detected from line spacing. Harder. The hardest is fully-styled .docx with tables, columns, images, footnotes, and numbered lists — needs heuristics for each structural element, and is wrong roughly a third of the time on any PDF more complex than a one-column letter.
Reading order, the hardest part.
A two-column page lists glyphs by position, not order. Without analysis, the converter reads across the columns — line 1 of column 1, line 1 of column 2, line 2 of column 1, line 2 of column 2 — producing nonsense. Good converters cluster glyphs into columns first by finding vertical gutters, then flow each column top-to-bottom. Tables make it worse: tables can sometimes be detected by their grid lines, but tables without visible borders are ambiguous.
Scanned PDFs need OCR.
A "PDF" can be (a) a real PDF whose pages contain text glyphs, or (b) a PDF that's just an embedded scanned image. The first extracts cleanly; the second extracts to nothing because there is no text — just pixels. OCR (optical character recognition, via Tesseract or a cloud OCR service) converts the image back to text, with a variable error rate. Always check whether the source PDF is searchable before assuming the conversion will work.
A worked extraction.
A two-page memo with one heading, two paragraphs, and a bulleted list. The converter groups glyphs by line (similar Y coordinates), groups lines into paragraphs (similar line spacing, no large vertical gap), and detects the heading (larger font size, bolder weight). Bullets are tricky: usually heuristic-detected by the presence of a leading "•" / "—" / digit-period at the line start. Output is a .docx with three paragraph styles (heading, body, list-item) and the text content flowed sensibly.
Detected structure
font size, weight, line spacing
Properties of the glyphs inform structural inference.
larger+bold → heading ; • prefix → list item
= Three paragraph styles
What never round-trips.
Form fields, comments, embedded fonts you don't have, locked digital signatures, encrypted content streams, annotations — all of these either drop or arrive in degraded form. The output .docx loses information; if you need to faithfully edit the document, the right tool is the original Word source (if you can get it) or a dedicated PDF editor (Acrobat Pro, Foxit) that operates on the PDF in place.
When this is the right tool.
For a PDF you need to substantially rewrite — a contract you're editing, a chapter you're translating, a memo you're updating — converting to Word and editing there is usually faster than living inside the PDF. The conversion's imperfections are fixable in five minutes; the rewrite would take an hour either way. For a PDF you just need to skim, paste, or quote from, plain-text extraction is enough.