Understanding sentiment analysis
A guess at how a sentence feels.
The two algorithm families behind sentiment scores, what they get right, and the three classes of failure they share.
Two algorithm families.
Lexicon-based: a dictionary maps each word to a polarity score (-1 to +1). Sum the scores in the text, normalise by length. Fast, no training data needed, easy to inspect. The libraries: VADER, AFINN, SentiWordNet. ML-based: a model (logistic regression, BERT, GPT) trained on labelled text predicts positive/negative/neutral. Slower to run, much more accurate when the input matches the training distribution.
VADER is the workhorse.
VADER (Valence Aware Dictionary and sEntiment Reasoner, 2014) is the de facto lightweight sentiment analyser. It handles social-media features that generic lexicons miss: ALL CAPS amplifies (+1.5×), "!!" amplifies, "very" or "really" boost, "not" flips polarity, emojis carry built-in scores. The output is four numbers: positive, negative, neutral, and a compound score (the normalised sum, -1 to +1). Most apps just use the compound.
A worked score.
"The food was great but the service was terrible." VADER scores: positive 0.31 ("great"), negative 0.28 ("terrible"), neutral 0.41, compound ≈ 0.01 (nearly neutral). The compound captures the mixed reality. Most sentiment tools fail here — they pick a winner ("great" outweighs "terrible" → "positive") when the truth is "this customer has mixed feelings, drill down".
Mixed sentiment
positive + negative tokens in one sentence
The compound score captures the mix; class labels don't.
pos 0.31 + neg 0.28 → compound ≈ 0.01
= Neutral overall, but interesting
The three failure modes.
Sarcasm: "Oh great, another Monday." Lexicon says positive ("great"); meaning is negative. Neither lexicon nor small ML models reliably detect it. Negation scope: "I don't think this is bad" — the "not" attaches to "think", not "bad"; the sentence is positive. Simple negation-flipping gets this wrong. Domain shift: sentiment from movie reviews trained on the IMDB corpus doesn't generalise to medical reviews, financial news, or product feedback without retraining.
What it's actually good for.
Aggregating sentiment across many short messages — tweet streams, customer reviews, support tickets. Individual scores are noisy; averages are informative ("our NPS-equivalent score from chat dropped 0.2 last week"). For individual decisions ("flag this comment as toxic"), the false-positive rate is high; pair with human review. For trend analysis at scale, sentiment is a fine signal.
LLM-based sentiment is the new bar.
Modern small LLMs (Llama, Mistral, Phi) handle sarcasm, negation scope, and domain shift much better than VADER — at 100-1000× the compute cost. For batch processing of millions of messages, VADER is right. For per-message high-stakes decisions (legal compliance, brand safety), an LLM call with a structured prompt is now affordable and more accurate. The cost-accuracy curve has moved; pick the point on it that matches your use case.