Understanding PDFs
A page tree, with strings attached.
What's actually inside a PDF, and why some operations are near-instant while others repaint every page from scratch.
What a PDF actually is.
A PDF file is a small typesetting program. Each page is a stream of drawing commands — set a font, move the pen, render a glyph, draw a line — interpreted by the reader at display or print time. There's a tree of objects (pages, fonts, images, metadata) at the start, an offset table at the end, and a single trailer pointing back at the start. Adobe published it in 1993, ISO standardised it in 2008, and it's barely changed since.
page = a stream of drawing commands
Merge and split — almost free.
Combining two PDFs doesn't re-render anything. The merger reads each page object, copies it verbatim into a new file, rebuilds the page tree, and rewrites the trailer. Splitting is the same thing in reverse — pick a page range, copy those objects, drop the rest. Both operations are bytes-in, bytes-out: the original drawing instructions survive untouched, so the output looks pixel-identical to the source.
PDF → images: rendering.
Turning a PDF into PNGs means actually executing the drawing program for each page and capturing the result as pixels. That's the renderer's job — what your viewer does every time you open the file. The output resolution determines the fidelity: 1× scale at 96 DPI looks like screen output; 2× or 3× gives crisp print-quality renders. Be careful with large multi-page PDFs at 4×: a 100-page brief at 300 DPI easily produces gigabytes of PNGs.
PDF → text: only if it's there.
A "digital" PDF — exported from Word, generated by a server, typeset by InDesign — embeds the actual text inside the drawing commands. Extracting it is a matter of reading the stream. A "scanned" PDF, on the other hand, is just a picture of text, with no characters embedded. Extraction fails on it because the characters were never there. The fix is OCR — running a recogniser over the rendered pages, producing text after the fact. The two cases look identical in a viewer; only the underlying structure tells them apart.
Compressing a PDF.
Most PDF size comes from embedded images. The standard compression recipe: re-encode embedded images at a lower quality (or smaller resolution), strip metadata, drop unused fonts, recompress the streams with a tighter zlib. Done right, a 50 MB PDF can shrink to 10 MB without visibly changing the document. Done wrong, the text turns soft — the recompressor was treating text-as-image. Tools that target images and leave text alone produce the best results.
Page sizes and orientation.
PDF pages can be any size — A4, Letter, custom millimetres, even mixed within the same file. A common bug pattern after merging: page sizes mismatch and the reader auto-fits each, which makes the document feel uneven. The fix is to settle on a target size before merging and resize/letterbox each input to it. The same applies to orientation: a sideways page among portrait ones is jarring; rotate the metadata before saving.
Read next