Does it work on scanned PDFs?

No — scanned PDFs contain images of text, not selectable text. Use an OCR tool first to convert image-text into selectable text.

How do I know if my PDF has selectable text?

Open it in any PDF viewer and try to select text. If you can copy individual words, our tool will extract them. If you can only select the whole page as a single image, it's a scan.

Are my PDFs uploaded?

No — extraction happens entirely in your browser via PDF.js.

Is the layout preserved?

Approximately — line breaks and column order are best-effort. Complex layouts (multi-column papers, footnotes) may need cleanup.

Free PDF Text Extractor · AnytimeConvert

Understanding PDFs

A page tree, with strings attached.

What's actually inside a PDF, and why some operations are near-instant while others repaint every page from scratch.

What a PDF actually is.

A PDF file is a small typesetting program. Each page is a stream of drawing commands — set a font, move the pen, render a glyph, draw a line — interpreted by the reader at display or print time. There's a tree of objects (pages, fonts, images, metadata) at the start, an offset table at the end, and a single trailer pointing back at the start. Adobe published it in 1993, ISO standardised it in 2008, and it's barely changed since.

page = a stream of drawing commands

Merge and split — almost free.

Combining two PDFs doesn't re-render anything. The merger reads each page object, copies it verbatim into a new file, rebuilds the page tree, and rewrites the trailer. Splitting is the same thing in reverse — pick a page range, copy those objects, drop the rest. Both operations are bytes-in, bytes-out: the original drawing instructions survive untouched, so the output looks pixel-identical to the source.

PDF → images: rendering.

Turning a PDF into PNGs means actually executing the drawing program for each page and capturing the result as pixels. That's the renderer's job — what your viewer does every time you open the file. The output resolution determines the fidelity: 1× scale at 96 DPI looks like screen output; 2× or 3× gives crisp print-quality renders. Be careful with large multi-page PDFs at 4×: a 100-page brief at 300 DPI easily produces gigabytes of PNGs.

PDF → text: only if it's there.

A "digital" PDF — exported from Word, generated by a server, typeset by InDesign — embeds the actual text inside the drawing commands. Extracting it is a matter of reading the stream. A "scanned" PDF, on the other hand, is just a picture of text, with no characters embedded. Extraction fails on it because the characters were never there. The fix is OCR — running a recogniser over the rendered pages, producing text after the fact. The two cases look identical in a viewer; only the underlying structure tells them apart.

Compressing a PDF.

Most PDF size comes from embedded images. The standard compression recipe: re-encode embedded images at a lower quality (or smaller resolution), strip metadata, drop unused fonts, recompress the streams with a tighter zlib. Done right, a 50 MB PDF can shrink to 10 MB without visibly changing the document. Done wrong, the text turns soft — the recompressor was treating text-as-image. Tools that target images and leave text alone produce the best results.

Page sizes and orientation.

PDF pages can be any size — A4, Letter, custom millimetres, even mixed within the same file. A common bug pattern after merging: page sizes mismatch and the reader auto-fits each, which makes the document feel uneven. The fix is to settle on a target size before merging and resize/letterbox each input to it. The same applies to orientation: a sideways page among portrait ones is jarring; rotate the metadata before saving.

PDF Text Extractor

Settings

Result