name: mteb-leaderboard description: Guidance for querying ML model leaderboards and benchmarks (MTEB, HuggingFace, embedding benchmarks). This skill applies when tasks involve finding top-performing models on specific benchmarks, compar
MTEB Leaderboard Query Skill
This skill provides guidance for accurately querying machine learning model leaderboards and benchmarks, particularly the Massive Text Embedding Benchmark (MTEB) and related embedding leaderboards.
When to Use This Skill
- Finding top-performing models on specific benchmarks (MTEB, Scandinavian Embedding Benchmark, etc.)
- Answering questions about current leaderboard standings
- Comparing model performance across different benchmarks
- Tasks with specific temporal requirements (e.g., "as of August 2025")
Core Approach
Step 1: Identify Authoritative Data Sources
Before searching for results, establish which sources contain authoritative, current data:
Primary Sources (prefer these):
- Official leaderboard websites (e.g.,
mteb-leaderboardon HuggingFace Spaces) - GitHub repositories with raw benchmark data
- API endpoints or JSON data files from leaderboard maintainers
- Official leaderboard websites (e.g.,
Secondary Sources (use with caution):
- Academic papers (often outdated by publication time)
- Blog posts and articles (may reference outdated results)
- News articles about benchmark results
Step 2: Verify Temporal Alignment
When a task specifies a time constraint (e.g., "as of August 2025"):
- Check source publication/update dates - Academic papers are typically 6-18 months behind current leaderboard state
- Look for "last updated" timestamps on leaderboard pages
- Never assume paper results reflect current standings without verification
- Be explicit about temporal gaps - If using data from June 2024 to answer about August 2025, this is a 14+ month gap that likely invalidates the data
Step 3: Access Live Leaderboard Data
When web pages don't render properly (interactive charts, JavaScript-heavy pages):
Look for raw data endpoints:
- Check for
/api/or/data/endpoints - Search for JSON files in the page source
- Look for GitHub repositories backing the leaderboard
- Check for
Try alternative access methods:
- HuggingFace Spaces often have Gradio APIs
- Many leaderboards publish CSV/JSON exports
- Check GitHub issues/discussions for data access tips
Search for data repositories:
site:github.com [leaderboard name] results jsonsite:huggingface.co [benchmark name] leaderboard
Step 4: Validate Model Eligibility
Do not make assumptions about which models "count" on a leaderboard:
- Check official leaderboard criteria - Some include API models, some don't
- Verify the answer format requirements against actual leaderboard entries
- Do not exclude models based on assumptions about what can be represented in a given format
- Consider all model types: open-source, API-based, fine-tuned variants
Verification Strategies
Cross-Reference Multiple Sources
- Compare results from at least 2-3 independent sources
- If sources disagree, prioritize the most recent authoritative source
- Document discrepancies and their potential causes
Sanity Check Results
- Verify the model actually appears on the leaderboard
- Confirm the model name/organization format matches the source
- Check if the model was released before the specified date
Test Alternative Access Methods
When primary access fails:
- Try the Wayback Machine for historical snapshots
- Search for leaderboard maintainer announcements
- Look for community discussions about recent changes
- Check if there's a programmatic API
Common Pitfalls to Avoid
1. Relying on Outdated Academic Papers
Academic papers have publication delays of 3-12 months. A paper published in June 2024 contains data from early 2024 at best. Never use paper results for questions about current standings.
2. Giving Up When Web Scraping Fails
Interactive leaderboards often don't render in simple web fetches. Always try:
- Looking for underlying data files
- Checking GitHub repositories
- Finding API endpoints
- Searching for data exports
3. Making Assumptions About Model Format
Do not assume API models (OpenAI, Cohere, etc.) cannot be valid answers. Check the actual task requirements and leaderboard contents.
4. Premature Conclusion Without Verification
Before writing a final answer:
- Verify the model appears on the actual leaderboard
- Confirm the ranking is current
- Check that the model meets all task requirements
5. Ignoring Temporal Requirements
If a task asks about a specific date, ensure data sources reflect that timeframe. A 14-month gap between data and required date is unacceptable.
Systematic Search Strategy
When searching for leaderboard information:
Start broad, then narrow:
[benchmark name] leaderboard 2025[benchmark name] top models currentsite:huggingface.co [benchmark name]
Search for raw data:
[benchmark name] results github[benchmark name] json data[benchmark name] api
Search for recent updates:
[benchmark name] new top model [current year][benchmark name] leaderboard update
Avoid repetitive similar queries - If a query pattern isn't working after 2-3 attempts, change the approach rather than making minor variations
Output Checklist
Before submitting an answer, verify:
- Data source is current (not outdated paper)
- Model appears on the actual leaderboard
- Temporal requirements are met
- Model format matches requirements
- No unvalidated assumptions were made
- Answer was cross-referenced where possible
Overview
This skill provides practical guidance for querying machine learning model leaderboards and benchmarks such as MTEB, HuggingFace leaderboards, and embedding benchmarks. It focuses on finding top-performing models, comparing standings across sources, and handling time-sensitive queries while avoiding common pitfalls with outdated or interactive data sources.
How this skill works
First, identify authoritative sources (official leaderboard pages, GitHub data, or API/JSON endpoints). Then verify temporal alignment by checking last-updated timestamps and cross-referencing multiple sources. When pages are interactive or JavaScript-driven, locate raw data endpoints, CSV/JSON exports, or repository artifacts to extract canonical results. Finally, validate model eligibility and record any discrepancies.
When to use it
Finding the highest-ranked models on MTEB or other embedding leaderboards
Answering questions about leaderboard standings as of a specific date
Comparing model performance across multiple benchmarks
Retrieving live or recently-updated leaderboard data
Resolving conflicts between paper results and live leaderboards
Best practices
Prefer primary sources: official leaderboard pages, APIs, or GitHub-hosted raw data
Always check 'last updated' timestamps and be explicit about temporal gaps
Cross-reference at least 2–3 independent sources and prioritize the most recent authoritative one
Search for raw JSON/CSV endpoints or GitHub repositories before relying on rendered pages
Verify model eligibility and exact name formatting on the actual leaderboard
Example use cases
User asks: 'Which model ranked first on MTEB as of August 2025?' — locate the leaderboard snapshot or API data with that date and report the top model plus data source
Comparing two models across MTEB and a Scandinavian embedding benchmark to identify consistent top performers
Extracting CSV/JSON from a HuggingFace Spaces backend when the web UI is interactive or fails to render
Validating whether an API-based model qualifies on a benchmark that lists only open-source entries
FAQ
What if the leaderboard page is interactive and fetches data via JavaScript?
Look for underlying API endpoints, JSON/CSV exports, or the project's GitHub repo. Many Spaces or Gradio apps expose programmatic endpoints.
Can I use academic papers to answer current leaderboard queries?
Not reliably. Papers are often mon