embedding-models-and-domain-adaptation - SKILL.md Agent Skill

name: embedding-models-and-domain-adaptation description: > Choose, train, and adapt text embedding models for production. Covers SBERT vs cross-encoder tradeoffs, contrastive learning, domain adaptation via continued pretraining, few-shot classification with SetFit, and hard-negative mining. Use when building semantic search, RAG retrieval, classification with limited labels, when generic embeddings underperform on your domain, or when picking between cross-encoder and bi-encoder architectures. Triggers: "choose an embedding model", "domain-adapt our embeddings", "few-shot classification", "SetFit for X", "cross-encoder vs bi-encoder", "embedding fine-tuning", "hard negatives". Produces concrete embedding strategy.

Embedding Models and Domain Adaptation

You design and adapt embedding models for production retrieval, classification, and similarity tasks. Off-the-shelf embeddings work for general tasks but underperform on specialized domains. This skill covers when and how to adapt.

When to Use This Skill

Building semantic search or RAG and choosing the embedding model
Off-the-shelf embeddings underperform on your domain (legal, medical, internal docs)
Few-shot classification with limited labeled data
Choosing between cross-encoder and bi-encoder architectures
Setting up contrastive training for custom embeddings

Bi-encoder vs Cross-encoder: Architectural Choice

The fundamental tradeoff in retrieval / similarity systems.

Architecture	How it works	Pros	Cons
Bi-encoder (SBERT)	Two identical BERT models; produces independent embeddings; compare via dot product	Fast at scale; embeddings can be precomputed; supports vector search	Less accurate than cross-encoder for complex similarity
Cross-encoder	Pair of sentences fed jointly to one Transformer; outputs single similarity score	Highest accuracy; captures interactions between sentences	Slow; can't precompute; doesn't scale to large corpora

When to Use Which

Need to search/retrieve over a corpus of N documents?
├── N < 1000 and latency is not critical → Cross-encoder works
├── N > 1000, need fast retrieval → Bi-encoder (SBERT)
└── Want best of both → Bi-encoder for first-pass retrieval + Cross-encoder for reranking top-k

Production default: Hybrid. Bi-encoder retrieves top-50; cross-encoder reranks to top-5. This pattern dominates in production search.

Sentence-Transformers (SBERT): The Production Standard

SBERT solves the cross-encoder cost problem by using mean pooling over the final BERT layer. This produces fixed-size embeddings that scale.

Mean pooling averages the token embeddings to get a single vector
Siamese network structure trains both encoders with shared weights
Contrastive learning is the typical training objective

The sentence-transformers library is the production-grade implementation.

Contrastive Learning: How Embeddings Get Trained

Train the model on pairs of similar and dissimilar examples. The model learns to map similar pairs close together in embedding space, dissimilar pairs far apart.

Components

Component	What it is	Why it matters
Positive pair	Two semantically similar texts	Anchors what "similar" means in your domain
Negative pair	Two semantically different texts	Anchors what "different" means
Hard negatives	Negatives that look similar but are semantically different	Forces the model to learn fine-grained distinctions
Easy negatives	Negatives that are obviously different	Cheap to mine but limit ceiling

Hard Negatives Are the Lever

Easy negatives produce mediocre models. Hard negatives produce strong models.

Hard negative mining strategies:

BM25 top-K (textually similar but wrong answer)
Same topic, different intent
Adversarial generation via LLM
Existing retrieval system's top wrong results

Rule: If your retrieval is mediocre, the issue is usually negatives, not architecture.

Domain Adaptation: When Generic Embeddings Fail

Generic embedding models (OpenAI ada, Cohere, Voyage) work for general text but underperform on domain-specific corpora (legal, medical, internal documentation, non-English).

Two-Step Adaptation

Step 1: Continued Pretraining (MLM)
  - Take pretrained BERT
  - Continue training on YOUR domain text using masked language modeling
  - Updates subword representations to your vocabulary
  - Output: domain-aware base model

Step 2: Fine-tuning with Contrastive Objective
  - Use Step 1 model as starting point
  - Train on (positive, negative) pairs from your domain
  - Output: domain-adapted embedding model

Lighter-Weight Alternatives

Method	Best when	Limitation
TSDAE (Transformer-based Sequential Denoising Auto-Encoder)	Unlabeled domain text only; no pairs available	Quality below contrastive but better than generic
SimCSE	Similar to TSDAE but uses different augmentation	Comparable quality
Synthetic pair generation	Use LLM to generate positive/negative pairs from your corpus	Quality depends on LLM and prompts

SetFit: Few-Shot Classification

When you have limited labeled data (e.g., 32 examples per class) but need a classifier, SetFit is the right tool.

How SetFit Works

Take a pretrained sentence-transformer
Generate positive/negative pairs from your few labeled examples (e.g., same-class = positive)
Fine-tune the embedding model with contrastive learning on those pairs
Train a simple classifier head (logistic regression) on the resulting embeddings
Result: high-accuracy classifier from <50 labels

When to use SetFit: Few-shot classification (32-100 examples per class). Outperforms full fine-tuning of a classifier in this regime.

When NOT to use SetFit: You have plenty of labels (>1000 per class). Full fine-tuning of a smaller model wins on cost and speed.

Principles

Hybrid retrieval is the default. Bi-encoder for scale + cross-encoder for precision.
Hard negatives matter more than architecture choice. Spending time on negative mining beats picking a fancier model.
Domain adaptation is high-leverage. Generic embeddings underperform on specialized text. Continued pretraining + fine-tuning closes the gap.
Contrastive learning is the workhorse. Most modern embedding training uses some form of it.
Few-shot wins with the right framework. SetFit can match thousand-label classifiers using <100 labels.
Mean pooling > pooling alternatives in most cases. Stick with sentence-transformers conventions.

Anti-Patterns to Avoid

Cross-encoder for Large-Scale Search

Looks like: Using cross-encoder to score every (query, doc) pair in a 10K corpus.

Why it fails: O(N) at query time. Latency unbearable.

The fix: Bi-encoder for first-pass retrieval, cross-encoder for reranking top-K.

Easy Negatives Only

Looks like: Sampling negatives randomly from the corpus.

Why it fails: Random negatives are usually obvious; model doesn't learn fine distinctions.

The fix: Mine hard negatives via BM25, existing retrieval mistakes, or adversarial generation.

Skipping Domain Adaptation on Specialized Corpora

Looks like: Using OpenAI ada embeddings for medical document retrieval.

Why it fails: Generic model misses domain terminology; retrieval recall is poor.

The fix: Continued pretraining + contrastive fine-tuning on your corpus.

Full Fine-tuning When Few-Shot Works

Looks like: Labeling 5,000 examples to fine-tune a classifier when SetFit could work with 50.

Why it fails: Wasted labeling cost; longer iteration cycle.

The fix: Try SetFit first with <100 labels per class. Scale up only if quality is insufficient.

Decision Rules

Condition	Action
Need fast retrieval over large corpus	Bi-encoder (SBERT)
Need highest similarity precision, scale is small	Cross-encoder
Want both	Bi-encoder retrieval + cross-encoder reranking
Generic embeddings underperform on your domain	Continued pretraining + contrastive fine-tuning
Have unlabeled domain text only	TSDAE or SimCSE
Need classifier with <100 labels per class	SetFit
Have abundant labels	Full fine-tuning of smaller classifier
Retrieval quality is mediocre	Improve negatives before changing architecture

Worked Example: Internal Documentation Search

Goal: Build search over 50,000 internal engineering docs (RFCs, postmortems, runbooks).

Decision	Choice	Rationale
Architecture	Hybrid: BM25 + bi-encoder for retrieval, cross-encoder for reranking	Internal jargon needs BM25; semantic match needs embeddings; reranker improves precision@5
Embedding model	Off-the-shelf SBERT, then domain-adapt via continued pretraining + contrastive fine-tuning	Generic underperforms; domain has specific terminology
Continued pretraining	All 50K docs via MLM, 1 epoch	Updates subword reps to engineering vocabulary
Contrastive training pairs	Mined from search logs (clicked = positive, top-K not-clicked = hard negative)	Free signal from real users
Cross-encoder for reranking	ms-marco-MiniLM-L-12-v2 base, fine-tune on same pair distribution	Reranker captures fine-grained intent matching

Lesson: No single model is optimal. The architecture is a hybrid; domain adaptation is a multi-step process; negative mining matters more than model choice.

Gotchas

Embedding dimensionality vs latency. Higher dim (1536+) = better quality, more storage cost, slower search. Test dim reduction (e.g., 768→256 with PCA) for cost-quality tradeoff.
Batch size in contrastive training. Larger batches → better hard negatives within batch → better model. GPU memory is the constraint.
MLM masking rate. 15% is the BERT default; some domains benefit from higher rates.
Reranker latency. Cross-encoders add 100-500ms. Budget for it.
Cosine vs dot product similarity. Convention matters; mismatched normalization breaks downstream tools.

Source: Hands-On Large Language Models by Jay Alammar and Maarten Grootendorst, embedding chapters.