Understanding pitch shifting
Change the pitch, keep the timing.
Why pitch shifting is harder than playing slower, what semitones are, and the artifacts you can't fully avoid.
The naive way is wrong.
Speeding up an audio file makes it both faster and higher-pitched — like a record at the wrong RPM. Slowing it down makes it slower and lower. The two are linked by physics: pitch is frequency, frequency is cycles per second, playing more samples per second is both faster playback and higher pitch. To shift pitch alone (without changing speed) — or shift speed alone (without changing pitch) — you have to do real digital signal processing.
Semitones, the unit.
Pitch shifts are measured in semitones — twelve to an octave, each semitone is a frequency ratio of 2^(1/12) ≈ 1.0595. Shifting up by 12 semitones doubles the frequency; down by 12 halves it. "Cents" are 1/100 of a semitone for fine-tuning. A typical speech pitch shift to disguise a voice runs ±2-3 semitones; ±12 starts sounding like a different person.
How pitch-shifting works.
The phase vocoder is the classic algorithm. Split the audio into short overlapping windows, FFT each one into a frequency-domain representation, shift the frequencies by the desired ratio, inverse-FFT, overlap-add the results. The frequency-domain operation is "scale every component by the shift factor". The temporal length stays the same; the spectral content gets stretched along the frequency axis. Newer algorithms (PSOLA, WSOLA, granular) trade complexity for fewer artifacts.
A worked shift.
Shift a vocal up by 4 semitones (frequency ratio 1.26). With FFmpeg's rubberband filter: ffmpeg -i in.wav -af "rubberband=pitch=1.26" out.wav. Same length as input, vocal sounds noticeably higher, slight metallic artifact on plosives. For subtle pitch correction (±0.5 semitones), the artifact is inaudible. For dramatic shifts (±6 semitones or more), it's obvious.
+4 semitones
2^(4/12) ≈ 1.2599
Pitch ratio comes from semitone count via the 12th-root-of-2.
pitch_factor = 2^(semitones/12)
= ≈ 1.26× frequency
The artifact ceiling.
Every pitch-shift algorithm has a "no surprises" range and a "noticeable degradation" range. Phase vocoders are clean within ±2 semitones; beyond that, transients (drum hits, consonants) start sounding smeared. Specialised algorithms like Élastique Pro (used by professional DAWs) extend the clean range to ±6 or so. For dramatic effects you fall back on accepting the artifact as part of the sound — autotune, Daft Punk vocoder, anime "chipmunk".
Pitch correction vs pitch shift.
Pitch shift: move the whole signal up or down by a fixed amount. Pitch correction (autotune): detect the pitch of each note and pull it toward the nearest note on a scale. Same underlying tech (phase vocoder + spectral manipulation), different control signal — autotune adds note detection and a target scale. For "I want my voice deeper", pitch shift. For "I want my singing in tune", pitch correction.