name: opensearch-function-scoring-algorithms description: Search relevance and ranking on OpenSearch/Elasticsearch for a two-sided marketplace — candidate retrieval (hybrid BM25 + kNN, RRF, two-tower EBR), base relevance (BM25F, multi_match, LambdaMART), quality signals (Wilson lower bound, Bayesian average, rank_feature saturation/sigmoid), personalization (listing/user/session embeddings), spatial/temporal decay (gauss/exp), marketplace balance (conversion-weighted ranking, supply fairness, Pareto multi-objective), bias correction (IPS, click models, Thompson sampling), empirical evaluation (judgment sets, NDCG, ablation, A/B sizing, CUPED, regression suites), and diversity (MMR, DPP, max-per-host). Triggers on function_score, rank_feature, script_score, kNN, hybrid query, learning-to-rank, two-sided ranking, exposure fairness, NDCG, A/B testing, judgment set construction, ranking ablation, or "why is my OpenSearch ranking bad". Applies to Elasticsearch too — same APIs.
Marketplace-Research OpenSearch Function Scoring Best Practices
A reference distillation of research-backed algorithms for ranking in two-sided marketplaces (Airbnb, Uber Eats, DoorDash, Etsy, eBay, Booking.com) implemented on OpenSearch or Elasticsearch. Contains 56 rules across 9 categories, prioritised by cascade effect in the search ranking pipeline. Each rule explains the WHY (the cascade or the bias it corrects), shows incorrect-vs-correct code (OpenSearch JSON queries, Painless scripts, Python pre-processing, evaluation methodology), and links to the canonical source — KDD/SIGIR/WSDM papers, the OpenSearch documentation, and the engineering blogs of the marketplaces that proved these patterns at scale.
When to Apply
Reach for this skill when:
- Designing a new marketplace search system on OpenSearch or Elasticsearch from scratch
- Tuning function_score / rank_feature / script_score queries that aren't moving the needle
- Setting up hybrid retrieval (BM25 + dense vectors) with Reciprocal Rank Fusion
- Choosing between HNSW and IVF for billion-scale ANN indexes
- Adding personalization via listing/user embeddings or two-tower architectures
- Correcting position bias in click logs before retraining an LTR model
- Designing exposure-fairness or new-listing cold-start exposure allocation
- Composing decay functions (gauss / exp / linear) over geo + date + freshness
- Diversifying the top window with MMR, DPP, or per-host caps
- Debugging "why does my top-10 show 8 listings from one host?" or "why does ranking favor popular incumbents?"
- Building offline evaluation infrastructure — graded judgment sets, NDCG@k pipelines, ablation studies, regression query suites
- Designing A/B tests for ranking changes — MDE / power / sample-size pre-computation, CUPED variance reduction, online-offline correlation calibration
- Attributing lift to specific scoring components — "did my new bias-correction help, or was it the embeddings, or both?"
The rules apply to any OpenSearch/Elasticsearch-backed marketplace search regardless of vertical — accommodation, food delivery, restaurants, services, jobs, secondhand goods, real estate. Triggers include "marketplace ranking", "search relevance", "function_score", "rank_feature", "script_score", "kNN", "hybrid search", "RRF", "learning to rank", "embedding-based retrieval", "two-tower", "position bias", "MMR", "supply fairness", "Pareto multi-objective", "NDCG", "judgment set", "ablation study", "CUPED", "A/B sample size", "ranking eval", and "why are my search results bad".
The Search Ranking Lifecycle
Categories are derived from the marketplace search ranking pipeline. Earlier stages cascade — a miss in recall (stage 1) cannot be repaired by any downstream boost, and a wrong base relevance multiplies through every functional score:
Query → [1] Recall → [2] Base Relevance → [3] Quality Signals → [4] Personalization
→ [5] Geo/Time Decay → [6] Marketplace Balance → [7] Diversity Re-rank → Results
↑
[8] Bias Correction (applied across all stages
and into training)
↑
[9] Evaluation & Measurement (the meta-layer:
judgment sets, NDCG, ablation, A/B
sizing, CUPED — without these you
can't tell if any rule helped)
Rule Categories by Priority
| Priority | Category | Impact | Prefix | Rules |
|---|---|---|---|---|
| 1 | Candidate Retrieval & Recall | CRITICAL | recall- |
6 |
| 2 | Base Relevance & Field Scoring | CRITICAL | rel- |
7 |
| 3 | Quality Signals & Confidence Bounds | HIGH | qual- |
6 |
| 4 | Personalization & Embeddings | HIGH | pers- |
7 |
| 5 | Spatial & Temporal Decay | HIGH | decay- |
5 |
| 6 | Two-Sided Marketplace Balance | HIGH | market- |
7 |
| 7 | Bias Correction & Online Learning | HIGH | bias- |
6 |
| 8 | Evaluation & Measurement | HIGH | eval- |
7 |
| 9 | Diversity & Re-ranking | MEDIUM-HIGH | div- |
5 |
Quick Reference
1. Candidate Retrieval & Recall (CRITICAL)
recall-hybrid-rrf— Use Hybrid BM25 + kNN with Reciprocal Rank Fusionrecall-two-tower-ebr— Use Two-Tower Architecture for Embedding-Based Retrievalrecall-prefilter-knn— Apply Pre-Filter to kNN with Hard Constraintsrecall-hnsw-vs-ivf— Choose HNSW for Latency, IVF for Memory at Scalerecall-multi-stage— Split Retrieval into Cheap Recall and Expensive Re-rankrecall-query-expansion— Apply Synonym Expansion at Index Time for Recall, Query Time for Precision
2. Base Relevance & Field Scoring (CRITICAL)
rel-bm25f-field-weights— Tune BM25F Field Weights Before k1/brel-multi-match-strategy— Pick multi_match Type by Query Shape, Not by Defaultrel-bm25-k1-b-tuning— Tune BM25 k1 and b Per-Field for Short Marketplace Documentsrel-listwise-loss— Prefer Listwise (LambdaMART) over Pairwise (RankNet) LTR Lossrel-script-score-over-function-score— Use script_score Query, Not function_score, for Compositionrel-rescore-over-bool-should— Use rescore Phase for Heavy Scoring, Not bool/should at Retrievalrel-avoid-boost-inflation— Avoid Field-Boost Inflation Above ~10x
3. Quality Signals & Confidence Bounds (HIGH)
qual-wilson-lower-bound— Sort by Wilson Lower Bound, Not Average Ratingqual-bayesian-average— Use Bayesian Average for Star Ratings with Low Sample Sizesqual-rank-feature-saturation— Saturate Popularity Counts with rank_feature.saturationqual-rank-feature-sigmoid— Apply Sigmoid Modifier for Bounded Ratio Signalsqual-log1p-vs-saturation— Choose log1p over Saturation for Long-Tail Signal Preservationqual-completeness-score— Score Listing Completeness as a Quality Signal
4. Personalization & Embeddings (HIGH)
pers-listing-embeddings— Train Listing Embeddings from Booking-Session Co-occurrencepers-type-embeddings-cold-start— Use Type Embeddings for Cold-Start Users and Listingspers-real-time-session-vector— Update Session Vector in Real-Time from Click Eventspers-multi-modal-embeddings— Use Multi-Modal Embeddings (Text + Image) for Recallpers-cross-encoder-rerank— Apply Cross-Encoder Re-rank on Top-50 for Personalizationpers-tower-split-offline-online— Split Item Tower Offline, Query Tower Onlinepers-contextual-features— Inject Contextual Features into script_score
5. Spatial & Temporal Decay (HIGH)
decay-gauss-geo— Use Gauss Decay for Geo Distance, Not Lineardecay-exp-freshness— Use Exp Decay for Time Freshness, Gauss for Date Proximitydecay-scale-calibration— Calibrate Decay Scale to the 0.5-Score Distance Targetdecay-offset-noise— Add Offset to Decay Functions for Noisy Sparse Fieldsdecay-multi-field-composition— Compose Multi-Field Decay with Explicit Weights
6. Two-Sided Marketplace Balance (HIGH)
market-conversion-weighted-ranking— Weight Ranking by Conversion Rate, Not Click-Through Ratemarket-cold-start-exploration— Boost Cold-Start Listings with Bounded Exposure Allocationmarket-supply-fairness-lorenz— Monitor Supply-Side Fairness with Lorenz/Gini Metricsmarket-host-quality-signals— Separate Host-Quality and Listing-Quality Signalsmarket-inventory-health— Penalize Listings with Low Inventory Healthmarket-pareto-multi-objective— Optimize Multi-Objective Ranking with Pareto-Aware Weightsmarket-price-relevance— Score Price Relevance with Soft Bands, Not Hard Filters
7. Bias Correction & Online Learning (HIGH)
bias-position-ips— Correct Position Bias with Inverse Propensity Scoringbias-click-models— Estimate Click Propensities with PBM, Cascade, or DBNbias-thompson-sampling— Explore Ranking Alternatives with Thompson Samplingbias-counterfactual-eval— Validate Ranking Changes with Counterfactual Evaluationbias-interleaved-evaluation— Use Interleaved Evaluation for Low-Traffic Ranking Comparisonsbias-popularity-debiasing— Subsample Popular Items in Embedding Training Negatives
8. Evaluation & Measurement (HIGH)
eval-graded-judgment-set— Build a Graded Judgment Set for Offline Evaluationeval-ndcg-primary-metric— Use NDCG@k as the Primary Offline Ranking Metriceval-online-offline-correlation— Validate Online-Offline Metric Correlation Before Trusting Offline Scoreseval-ablation-attribution— Run Ablation Studies to Attribute Lift to Specific Componentseval-ab-sample-size-mde— Calculate A/B Sample Size from MDE Before Runningeval-cuped-variance-reduction— Apply CUPED to Halve A/B Sample Size with Pre-Experiment Covariateseval-regression-query-suite— Maintain a Regression Query Suite for Silent Quality Drops
9. Diversity & Re-ranking (MEDIUM-HIGH)
div-mmr-rerank— Apply MMR Rerank for Top-Window Diversitydiv-max-per-host— Cap Impressions Per Host with Max-Per-Group Constraintdiv-category-diversity— Diversify Categories Hierarchically in the Top Windowdiv-dpp-quality-diversity— Use Determinantal Point Processes for Joint Quality and Diversitydiv-window-penalty— Apply Window-Based Diversity Penalty in Rescore
How to Use
For a focused question ("which decay function for geo distance?"), jump directly to the relevant rule (decay-gauss-geo) — each rule is self-contained with the WHY, OpenSearch query/Painless code, and the canonical source citation.
For a full ranking system review, work the categories top-to-bottom. The cascade ordering is real: get recall right first (no boost recovers a missed candidate), then base relevance (it's the multiplicand of every functional score), then quality / personalization / decay / marketplace balance / bias correction in that order. Diversity is the last re-rank step over a well-ordered top window.
For correcting bias before retraining, start with bias-position-ips and bias-click-models — applying IPS to position-confounded click data is the single highest-leverage change for any marketplace that retrains LTR models on logged clicks.
For testing multiple algorithms together and validating empirically, start with eval-graded-judgment-set (build the foundation), eval-ndcg-primary-metric (pick the metric), then eval-ablation-attribution (attribute lift to specific components). Pair with eval-online-offline-correlation to verify your offline metric predicts online behavior, eval-ab-sample-size-mde + eval-cuped-variance-reduction for disciplined A/B testing, and eval-regression-query-suite to catch silent quality drops on named queries.
For research-citing a design decision, every rule ends with the canonical reference — KDD/SIGIR/WSDM papers, the relevant engineering blog (Airbnb, Pinterest, DoorDash, Etsy, Just Eat Takeaway, Thumbtack), or the OpenSearch documentation page.
Read section definitions for the cascade-impact rationale behind the category ordering, or the rule template when adding a new rule.
Reference Files
| File | Description |
|---|---|
| references/_sections.md | Category definitions and ordering by cascade impact |
| AGENTS.md | Compact TOC navigation (auto-built; do not edit by hand) |
| assets/templates/_template.md | Template for authoring new rules |
| metadata.json | Version and authoritative reference URLs |