opensearch-function-scoring-algorithms - SKILL.md Agent Skill

name: opensearch-function-scoring-algorithms description: Search relevance and ranking on OpenSearch/Elasticsearch for a two-sided marketplace — candidate retrieval (hybrid BM25 + kNN, RRF, two-tower EBR), base relevance (BM25F, multi_match, LambdaMART), quality signals (Wilson lower bound, Bayesian average, rank_feature saturation/sigmoid), personalization (listing/user/session embeddings), spatial/temporal decay (gauss/exp), marketplace balance (conversion-weighted ranking, supply fairness, Pareto multi-objective), bias correction (IPS, click models, Thompson sampling), empirical evaluation (judgment sets, NDCG, ablation, A/B sizing, CUPED, regression suites), and diversity (MMR, DPP, max-per-host). Triggers on function_score, rank_feature, script_score, kNN, hybrid query, learning-to-rank, two-sided ranking, exposure fairness, NDCG, A/B testing, judgment set construction, ranking ablation, or "why is my OpenSearch ranking bad". Applies to Elasticsearch too — same APIs.

Marketplace-Research OpenSearch Function Scoring Best Practices

A reference distillation of research-backed algorithms for ranking in two-sided marketplaces (Airbnb, Uber Eats, DoorDash, Etsy, eBay, Booking.com) implemented on OpenSearch or Elasticsearch. Contains 56 rules across 9 categories, prioritised by cascade effect in the search ranking pipeline. Each rule explains the WHY (the cascade or the bias it corrects), shows incorrect-vs-correct code (OpenSearch JSON queries, Painless scripts, Python pre-processing, evaluation methodology), and links to the canonical source — KDD/SIGIR/WSDM papers, the OpenSearch documentation, and the engineering blogs of the marketplaces that proved these patterns at scale.

When to Apply

Reach for this skill when:

Designing a new marketplace search system on OpenSearch or Elasticsearch from scratch
Tuning function_score / rank_feature / script_score queries that aren't moving the needle
Setting up hybrid retrieval (BM25 + dense vectors) with Reciprocal Rank Fusion
Choosing between HNSW and IVF for billion-scale ANN indexes
Adding personalization via listing/user embeddings or two-tower architectures
Correcting position bias in click logs before retraining an LTR model
Designing exposure-fairness or new-listing cold-start exposure allocation
Composing decay functions (gauss / exp / linear) over geo + date + freshness
Diversifying the top window with MMR, DPP, or per-host caps
Debugging "why does my top-10 show 8 listings from one host?" or "why does ranking favor popular incumbents?"
Building offline evaluation infrastructure — graded judgment sets, NDCG@k pipelines, ablation studies, regression query suites
Designing A/B tests for ranking changes — MDE / power / sample-size pre-computation, CUPED variance reduction, online-offline correlation calibration
Attributing lift to specific scoring components — "did my new bias-correction help, or was it the embeddings, or both?"

The rules apply to any OpenSearch/Elasticsearch-backed marketplace search regardless of vertical — accommodation, food delivery, restaurants, services, jobs, secondhand goods, real estate. Triggers include "marketplace ranking", "search relevance", "function_score", "rank_feature", "script_score", "kNN", "hybrid search", "RRF", "learning to rank", "embedding-based retrieval", "two-tower", "position bias", "MMR", "supply fairness", "Pareto multi-objective", "NDCG", "judgment set", "ablation study", "CUPED", "A/B sample size", "ranking eval", and "why are my search results bad".

The Search Ranking Lifecycle

Categories are derived from the marketplace search ranking pipeline. Earlier stages cascade — a miss in recall (stage 1) cannot be repaired by any downstream boost, and a wrong base relevance multiplies through every functional score:

Query → [1] Recall → [2] Base Relevance → [3] Quality Signals → [4] Personalization
      → [5] Geo/Time Decay → [6] Marketplace Balance → [7] Diversity Re-rank → Results
                                                            ↑
                                          [8] Bias Correction (applied across all stages
                                                       and into training)
                                                            ↑
                                          [9] Evaluation & Measurement (the meta-layer:
                                                       judgment sets, NDCG, ablation, A/B
                                                       sizing, CUPED — without these you
                                                       can't tell if any rule helped)

Rule Categories by Priority

Priority	Category	Impact	Prefix	Rules
1	Candidate Retrieval & Recall	CRITICAL	`recall-`	6
2	Base Relevance & Field Scoring	CRITICAL	`rel-`	7
3	Quality Signals & Confidence Bounds	HIGH	`qual-`	6
4	Personalization & Embeddings	HIGH	`pers-`	7
5	Spatial & Temporal Decay	HIGH	`decay-`	5
6	Two-Sided Marketplace Balance	HIGH	`market-`	7
7	Bias Correction & Online Learning	HIGH	`bias-`	6
8	Evaluation & Measurement	HIGH	`eval-`	7
9	Diversity & Re-ranking	MEDIUM-HIGH	`div-`	5

Quick Reference

1. Candidate Retrieval & Recall (CRITICAL)

recall-hybrid-rrf — Use Hybrid BM25 + kNN with Reciprocal Rank Fusion
recall-two-tower-ebr — Use Two-Tower Architecture for Embedding-Based Retrieval
recall-prefilter-knn — Apply Pre-Filter to kNN with Hard Constraints
recall-hnsw-vs-ivf — Choose HNSW for Latency, IVF for Memory at Scale
recall-multi-stage — Split Retrieval into Cheap Recall and Expensive Re-rank
recall-query-expansion — Apply Synonym Expansion at Index Time for Recall, Query Time for Precision

2. Base Relevance & Field Scoring (CRITICAL)

rel-bm25f-field-weights — Tune BM25F Field Weights Before k1/b
rel-multi-match-strategy — Pick multi_match Type by Query Shape, Not by Default
rel-bm25-k1-b-tuning — Tune BM25 k1 and b Per-Field for Short Marketplace Documents
rel-listwise-loss — Prefer Listwise (LambdaMART) over Pairwise (RankNet) LTR Loss
rel-script-score-over-function-score — Use script_score Query, Not function_score, for Composition
rel-rescore-over-bool-should — Use rescore Phase for Heavy Scoring, Not bool/should at Retrieval
rel-avoid-boost-inflation — Avoid Field-Boost Inflation Above ~10x

3. Quality Signals & Confidence Bounds (HIGH)

qual-wilson-lower-bound — Sort by Wilson Lower Bound, Not Average Rating
qual-bayesian-average — Use Bayesian Average for Star Ratings with Low Sample Sizes
qual-rank-feature-saturation — Saturate Popularity Counts with rank_feature.saturation
qual-rank-feature-sigmoid — Apply Sigmoid Modifier for Bounded Ratio Signals
qual-log1p-vs-saturation — Choose log1p over Saturation for Long-Tail Signal Preservation
qual-completeness-score — Score Listing Completeness as a Quality Signal

4. Personalization & Embeddings (HIGH)

pers-listing-embeddings — Train Listing Embeddings from Booking-Session Co-occurrence
pers-type-embeddings-cold-start — Use Type Embeddings for Cold-Start Users and Listings
pers-real-time-session-vector — Update Session Vector in Real-Time from Click Events
pers-multi-modal-embeddings — Use Multi-Modal Embeddings (Text + Image) for Recall
pers-cross-encoder-rerank — Apply Cross-Encoder Re-rank on Top-50 for Personalization
pers-tower-split-offline-online — Split Item Tower Offline, Query Tower Online
pers-contextual-features — Inject Contextual Features into script_score

5. Spatial & Temporal Decay (HIGH)

decay-gauss-geo — Use Gauss Decay for Geo Distance, Not Linear
decay-exp-freshness — Use Exp Decay for Time Freshness, Gauss for Date Proximity
decay-scale-calibration — Calibrate Decay Scale to the 0.5-Score Distance Target
decay-offset-noise — Add Offset to Decay Functions for Noisy Sparse Fields
decay-multi-field-composition — Compose Multi-Field Decay with Explicit Weights

6. Two-Sided Marketplace Balance (HIGH)

market-conversion-weighted-ranking — Weight Ranking by Conversion Rate, Not Click-Through Rate
market-cold-start-exploration — Boost Cold-Start Listings with Bounded Exposure Allocation
market-supply-fairness-lorenz — Monitor Supply-Side Fairness with Lorenz/Gini Metrics
market-host-quality-signals — Separate Host-Quality and Listing-Quality Signals
market-inventory-health — Penalize Listings with Low Inventory Health
market-pareto-multi-objective — Optimize Multi-Objective Ranking with Pareto-Aware Weights
market-price-relevance — Score Price Relevance with Soft Bands, Not Hard Filters

7. Bias Correction & Online Learning (HIGH)

bias-position-ips — Correct Position Bias with Inverse Propensity Scoring
bias-click-models — Estimate Click Propensities with PBM, Cascade, or DBN
bias-thompson-sampling — Explore Ranking Alternatives with Thompson Sampling
bias-counterfactual-eval — Validate Ranking Changes with Counterfactual Evaluation
bias-interleaved-evaluation — Use Interleaved Evaluation for Low-Traffic Ranking Comparisons
bias-popularity-debiasing — Subsample Popular Items in Embedding Training Negatives

8. Evaluation & Measurement (HIGH)

eval-graded-judgment-set — Build a Graded Judgment Set for Offline Evaluation
eval-ndcg-primary-metric — Use NDCG@k as the Primary Offline Ranking Metric
eval-online-offline-correlation — Validate Online-Offline Metric Correlation Before Trusting Offline Scores
eval-ablation-attribution — Run Ablation Studies to Attribute Lift to Specific Components
eval-ab-sample-size-mde — Calculate A/B Sample Size from MDE Before Running
eval-cuped-variance-reduction — Apply CUPED to Halve A/B Sample Size with Pre-Experiment Covariates
eval-regression-query-suite — Maintain a Regression Query Suite for Silent Quality Drops

9. Diversity & Re-ranking (MEDIUM-HIGH)

div-mmr-rerank — Apply MMR Rerank for Top-Window Diversity
div-max-per-host — Cap Impressions Per Host with Max-Per-Group Constraint
div-category-diversity — Diversify Categories Hierarchically in the Top Window
div-dpp-quality-diversity — Use Determinantal Point Processes for Joint Quality and Diversity
div-window-penalty — Apply Window-Based Diversity Penalty in Rescore

How to Use

For a focused question ("which decay function for geo distance?"), jump directly to the relevant rule (decay-gauss-geo) — each rule is self-contained with the WHY, OpenSearch query/Painless code, and the canonical source citation.

For a full ranking system review, work the categories top-to-bottom. The cascade ordering is real: get recall right first (no boost recovers a missed candidate), then base relevance (it's the multiplicand of every functional score), then quality / personalization / decay / marketplace balance / bias correction in that order. Diversity is the last re-rank step over a well-ordered top window.

For correcting bias before retraining, start with bias-position-ips and bias-click-models — applying IPS to position-confounded click data is the single highest-leverage change for any marketplace that retrains LTR models on logged clicks.

For testing multiple algorithms together and validating empirically, start with eval-graded-judgment-set (build the foundation), eval-ndcg-primary-metric (pick the metric), then eval-ablation-attribution (attribute lift to specific components). Pair with eval-online-offline-correlation to verify your offline metric predicts online behavior, eval-ab-sample-size-mde + eval-cuped-variance-reduction for disciplined A/B testing, and eval-regression-query-suite to catch silent quality drops on named queries.

For research-citing a design decision, every rule ends with the canonical reference — KDD/SIGIR/WSDM papers, the relevant engineering blog (Airbnb, Pinterest, DoorDash, Etsy, Just Eat Takeaway, Thumbtack), or the OpenSearch documentation page.

Read section definitions for the cascade-impact rationale behind the category ordering, or the rule template when adding a new rule.

Reference Files

File	Description
references/_sections.md	Category definitions and ordering by cascade impact
AGENTS.md	Compact TOC navigation (auto-built; do not edit by hand)
assets/templates/_template.md	Template for authoring new rules
metadata.json	Version and authoritative reference URLs