Understanding plagiarism detection
N-gram fingerprints, side by side.
How a comparison-based plagiarism check works, the shingling algorithm behind it, and what it can and can't catch.
Shingling — the core idea.
Split each text into overlapping n-grams of (say) 5 consecutive words. "The quick brown fox jumps over the lazy dog" with n=5 gives "the quick brown fox jumps", "quick brown fox jumps over", "brown fox jumps over the", and so on. Each n-gram is a fingerprint of a 5-word span. Two texts share fingerprints if they share any 5-word spans. The proportion of shared n-grams is a similarity score.
Jaccard similarity.
The standard score: |A ∩ B| / |A ∪ B|. Number of shared n-grams divided by the size of the union. Output 0 (no overlap) to 1 (identical). For two normal essays on the same topic, expect 0.05-0.15 (some shared phrases for the subject matter). For copied text, expect 0.4 and up. The threshold for "this is plagiarism" depends on context — a textbook quotation rightly scores high; that's not plagiarism, it's a citation.
A worked comparison.
Two 500-word texts. Text A's shingle set has 496 unique 5-grams. Text B's has 489. The intersection: 287 shingles. Jaccard ≈ 287 / (496 + 489 − 287) ≈ 0.41. That's high enough to be suspicious. Drill in: highlight the matching n-grams in both texts; if they cluster in long contiguous spans, someone copy-pasted; if they're scattered across stock phrases, it's coincidental shared idiom.
Jaccard on 5-gram shingles
|A ∩ B| / |A ∪ B|
Set intersection over set union.
287 / (496 + 489 − 287) ≈ 0.41
= Suspicious — drill in
The three failure modes.
Paraphrase: copied content with synonyms swapped breaks shingle matching but obviously remains plagiarism. Modern tools use semantic similarity (embedding distance) on top of shingling to catch this. Translation: rendering plagiarised content via Google Translate breaks shingle matching entirely — needs cross-language embedding models to catch. Citation: a quoted passage scores high but isn't plagiarism. Tools that flag it without context produce false-positive noise.
The AI-detection problem.
The 2023+ challenge: LLM-generated text doesn't shingle-match any source. It's "original" in the literal sense — the words weren't copied from anywhere. AI-detection tools (GPTZero, Turnitin's AI detector) look for stylistic regularities — perplexity scores, sentence-length distributions, vocabulary choices that humans don't typically use — and report a probability the text is AI-generated. Accuracy varies; false-positive rates on second-language writers are particularly bad. Treat AI-detection scores with much more scepticism than plagiarism scores.
What this tool is for.
A two-text comparison plagiarism check is right for "did this student copy from this source"? — a specific question with a defined answer. It's not right for "is this text original to the entire internet?" — that needs a web-scale comparison against a corpus (Turnitin, Copyscape, Grammarly). For local comparisons (assignment vs source, draft vs final, manuscript vs retraction database), the two-text shingle approach is exactly what you need.