duplicate-detection

star 2

Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies.

zilliztech By zilliztech schedule Updated 2/4/2026

name: duplicate-detection description: "Use when user needs to find duplicate or similar content. Triggers on: duplicate, deduplication, plagiarism detection, similar content, near-duplicate, similarity detection, content dedup, find copies."

Duplicate Detection

Batch detection of duplicate or highly similar content using vector similarity.

When to Activate

Activate this skill when:

  • User needs to find duplicates in a dataset
  • User mentions "deduplication", "plagiarism", "similar content"
  • User wants to clean data by removing near-duplicates
  • User needs to merge similar items (FAQ, tickets, products)

Do NOT activate when:

  • User needs general semantic search → use semantic-search
  • User needs to group by topic → use clustering
  • User needs recommendation → use rec-system

Interactive Flow

Step 1: Understand Duplicate Type

"What type of duplicates are you looking for?"

A) Exact duplicates (100% identical)

  • Copy-paste content
  • File deduplication
  • Use hash-based detection (fast)

B) Near-duplicates (semantically similar)

  • Paraphrased content
  • Rewritten articles
  • Use vector similarity (accurate)

C) Both

  • Hash for exact, vector for near
  • Most comprehensive

Which do you need? (A/B/C)

Step 2: Determine Threshold

"How similar should content be to be considered duplicate?"

Threshold Interpretation Use Case
0.95+ Near identical Strict deduplication
0.85-0.95 Very similar Plagiarism detection
0.75-0.85 Related FAQ merging
0.65-0.75 Loosely related Topic clustering

What threshold fits your needs?

Step 3: Confirm Configuration

"Based on your requirements:

  • Method: Hash + Vector (comprehensive)
  • Similarity threshold: 0.90
  • Embedding model: BGE-large

Proceed? (yes / adjust [what])"

Core Concepts

Mental Model: Finding Twins

Think of duplicate detection as finding twins in a crowd:

  • Identical twins (exact duplicates) → Same fingerprint (hash)
  • Fraternal twins (near-duplicates) → Similar appearance (vectors)
┌─────────────────────────────────────────────────────────┐
│                  Duplicate Detection                     │
│                                                          │
│  New Content: "Machine learning requires lots of data"   │
│                         │                                │
│          ┌──────────────┴──────────────┐                │
│          │                             │                │
│          ▼                             ▼                │
│  ┌───────────────┐           ┌───────────────┐         │
│  │   Hash Check  │           │ Vector Search │         │
│  │   (Exact)     │           │  (Semantic)   │         │
│  └───────┬───────┘           └───────┬───────┘         │
│          │                           │                  │
│          ▼                           ▼                  │
│  "Content hash matches      "Similar to: 'ML needs     │
│   doc_123" → EXACT DUP       big datasets' (0.92)"     │
│                              → NEAR DUPLICATE           │
│                                                          │
│  Result: {"is_duplicate": true, "type": "near",         │
│           "similarity": 0.92, "match": "doc_123"}       │
└─────────────────────────────────────────────────────────┘

Hash vs Vector Detection

Method Speed Finds Misses
Hash ★★★★★ Exact copies Paraphrases
Vector ★★★ Semantic similarity Different meanings same words
Both ★★★ Comprehensive Best coverage

Implementation

from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer
import hashlib

class DuplicateDetector:
    def __init__(self, uri: str = "./milvus.db", threshold: float = 0.9):
        self.client = MilvusClient(uri=uri)
        self.model = SentenceTransformer('BAAI/bge-large-en-v1.5')
        self.threshold = threshold
        self.collection_name = "duplicate_detection"
        self._init_collection()

    def _init_collection(self):
        if self.client.has_collection(self.collection_name):
            return

        schema = self.client.create_schema()
        schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
        schema.add_field("content_hash", DataType.VARCHAR, max_length=64)
        schema.add_field("content", DataType.VARCHAR, max_length=65535)
        schema.add_field("source", DataType.VARCHAR, max_length=512)
        schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=1024)

        index_params = self.client.prepare_index_params()
        index_params.add_index(field_name="embedding", index_type="AUTOINDEX", metric_type="COSINE")
        index_params.add_index(field_name="content_hash", index_type="TRIE")

        self.client.create_collection(
            collection_name=self.collection_name,
            schema=schema,
            index_params=index_params
        )

    def _hash_content(self, content: str) -> str:
        """Calculate content hash (normalized)"""
        normalized = ''.join(content.lower().split())
        return hashlib.md5(normalized.encode()).hexdigest()

    def check_duplicate(self, content: str, source: str = "") -> dict:
        """Check if content is duplicate"""
        content_hash = self._hash_content(content)

        # 1. Exact match (hash)
        exact_match = self.client.query(
            collection_name=self.collection_name,
            filter=f'content_hash == "{content_hash}"',
            output_fields=["id", "source"],
            limit=1
        )
        if exact_match:
            return {
                "is_duplicate": True,
                "type": "exact",
                "match_id": exact_match[0]["id"],
                "match_source": exact_match[0]["source"],
                "similarity": 1.0
            }

        # 2. Semantic similarity (vector)
        embedding = self.model.encode(content).tolist()
        results = self.client.search(
            collection_name=self.collection_name,
            data=[embedding],
            limit=1,
            output_fields=["id", "content", "source"]
        )

        if results[0] and results[0][0]["distance"] >= self.threshold:
            return {
                "is_duplicate": True,
                "type": "near",
                "match_id": results[0][0]["entity"]["id"],
                "match_source": results[0][0]["entity"]["source"],
                "similarity": results[0][0]["distance"],
                "match_preview": results[0][0]["entity"]["content"][:200] + "..."
            }

        return {
            "is_duplicate": False,
            "similarity": results[0][0]["distance"] if results[0] else 0
        }

    def add_content(self, content_id: str, content: str, source: str = ""):
        """Add content to library"""
        content_hash = self._hash_content(content)
        embedding = self.model.encode(content).tolist()

        self.client.insert(
            collection_name=self.collection_name,
            data=[{
                "id": content_id,
                "content_hash": content_hash,
                "content": content,
                "source": source,
                "embedding": embedding
            }]
        )

    def batch_dedup(self, items: list) -> dict:
        """Batch check and deduplicate
        items: [{"id": "...", "content": "...", "source": "..."}]
        Returns: {"unique": [...], "duplicates": [...]}
        """
        unique = []
        duplicates = []

        for item in items:
            result = self.check_duplicate(item["content"], item.get("source", ""))
            result["id"] = item["id"]

            if result["is_duplicate"]:
                duplicates.append(result)
            else:
                unique.append(item)
                self.add_content(item["id"], item["content"], item.get("source", ""))

        return {
            "unique": unique,
            "duplicates": duplicates,
            "unique_count": len(unique),
            "duplicate_count": len(duplicates),
            "duplicate_ratio": len(duplicates) / len(items) if items else 0
        }

# Usage
detector = DuplicateDetector(threshold=0.85)

# Check single content
result = detector.check_duplicate("Machine learning requires a lot of data")
if result["is_duplicate"]:
    print(f"Duplicate found! Similarity: {result['similarity']:.2f}")
else:
    detector.add_content("doc001", "Machine learning requires a lot of data", "blog.md")

# Batch deduplication
results = detector.batch_dedup([
    {"id": "1", "content": "Python is a programming language", "source": "a.txt"},
    {"id": "2", "content": "Python is a coding language", "source": "b.txt"},  # Near-duplicate
    {"id": "3", "content": "The weather is nice today", "source": "c.txt"},
])
print(f"Unique: {results['unique_count']}, Duplicates: {results['duplicate_count']}")

Threshold Selection Guide

Use Case Threshold Rationale
Exact dedup (data cleaning) 0.95+ Only near-identical
Plagiarism detection 0.85-0.90 Allow rewording
FAQ merging 0.80-0.85 Same question, different words
Similar content grouping 0.70-0.80 Topically related

Common Pitfalls

❌ Pitfall 1: Threshold Too Low

Problem: Everything is marked as duplicate

Why: Low threshold catches even loosely related content

Fix: Start with 0.90 and adjust down if needed

❌ Pitfall 2: Hash Only Detection

Problem: Paraphrased duplicates not detected

Why: Hash only catches exact matches

Fix: Always use vector similarity for semantic duplicates

❌ Pitfall 3: Not Normalizing Before Hash

Problem: Same content with different whitespace not detected

Why: Hash is sensitive to exact characters

Fix: Normalize content before hashing

normalized = ''.join(content.lower().split())

❌ Pitfall 4: Processing Order Matters

Problem: Different "original" detected depending on order

Why: First item becomes the reference

Fix: Process by timestamp (oldest first) or have clear policy

When to Level Up

Need Upgrade To
Group similar content clustering
Plagiarism with source matching verticals/plagiarism.md
Large-scale batch processing Add core:ray

References

  • Plagiarism detection: verticals/plagiarism.md
  • Content deduplication: verticals/content-dedup.md
  • FAQ merging: verticals/faq-merge.md
Install via CLI
npx skills add https://github.com/zilliztech/milvus-marketplace --skill duplicate-detection
Repository Details
star Stars 2
call_split Forks 2
navigation Branch main
article Path SKILL.md
More from Creator