Is my text checked against the internet?

No. This tool performs a local comparison between the two specific text blocks you provide rather than scanning a global database.

What is n-gram fingerprinting?

It is a method that breaks text into small, overlapping sequences of words to find exact matches. This allows the tool to detect shared phrasing even if the surrounding context has been modified.

Does my sensitive data leave my computer?

No. All text analysis and comparison logic runs entirely within your browser environment so your documents are never transmitted or stored.

What should I do if a high match is found?

Review the highlighted sections to ensure proper citations are used or rewrite the overlapping passages to ensure the content is original.

Free Plagiarism Checker (local)

Understanding plagiarism detection

N-gram fingerprints, side by side.

How a comparison-based plagiarism check works, the shingling algorithm behind it, and what it can and can't catch.

Shingling — the core idea.

Split each text into overlapping n-grams of (say) 5 consecutive words. "The quick brown fox jumps over the lazy dog" with n=5 gives "the quick brown fox jumps", "quick brown fox jumps over", "brown fox jumps over the", and so on. Each n-gram is a fingerprint of a 5-word span. Two texts share fingerprints if they share any 5-word spans. The proportion of shared n-grams is a similarity score.

Jaccard similarity.

The standard score: |A ∩ B| / |A ∪ B|. Number of shared n-grams divided by the size of the union. Output 0 (no overlap) to 1 (identical). For two normal essays on the same topic, expect 0.05-0.15 (some shared phrases for the subject matter). For copied text, expect 0.4 and up. The threshold for "this is plagiarism" depends on context — a textbook quotation rightly scores high; that's not plagiarism, it's a citation.

A worked comparison.

Two 500-word texts. Text A's shingle set has 496 unique 5-grams. Text B's has 489. The intersection: 287 shingles. Jaccard ≈ 287 / (496 + 489 − 287) ≈ 0.41. That's high enough to be suspicious. Drill in: highlight the matching n-grams in both texts; if they cluster in long contiguous spans, someone copy-pasted; if they're scattered across stock phrases, it's coincidental shared idiom.

Jaccard on 5-gram shingles

|A ∩ B| / |A ∪ B|

Set intersection over set union.

287 / (496 + 489 − 287) ≈ 0.41

= Suspicious — drill in

The three failure modes.

Paraphrase: copied content with synonyms swapped breaks shingle matching but obviously remains plagiarism. Modern tools use semantic similarity (embedding distance) on top of shingling to catch this. Translation: rendering plagiarised content via Google Translate breaks shingle matching entirely — needs cross-language embedding models to catch. Citation: a quoted passage scores high but isn't plagiarism. Tools that flag it without context produce false-positive noise.

The AI-detection problem.

The 2023+ challenge: LLM-generated text doesn't shingle-match any source. It's "original" in the literal sense — the words weren't copied from anywhere. AI-detection tools (GPTZero, Turnitin's AI detector) look for stylistic regularities — perplexity scores, sentence-length distributions, vocabulary choices that humans don't typically use — and report a probability the text is AI-generated. Accuracy varies; false-positive rates on second-language writers are particularly bad. Treat AI-detection scores with much more scepticism than plagiarism scores.

What this tool is for.

A two-text comparison plagiarism check is right for "did this student copy from this source"? — a specific question with a defined answer. It's not right for "is this text original to the entire internet?" — that needs a web-scale comparison against a corpus (Turnitin, Copyscape, Grammarly). For local comparisons (assignment vs source, draft vs final, manuscript vs retraction database), the two-text shingle approach is exactly what you need.

Plagiarism Checker (local)