name: bigdata-skill
description: >-
Pull Bigdata.com (RavenPack) financial and news data via the official
bigdata-client SDK and /v1/* REST endpoints — structured financials,
prices, analyst estimates, daily entity-sentiment series, annotated chunk
search, screener — when the Bigdata MCP returns only pre-synthesized tearsheets
but you need the machine-readable substrate. Use when the user mentions
Bigdata.com, RavenPack, a bd_v2_ key, the bigdata MCP, rp_entity_id,
chunk/query_unit cost, or wants structured financials, fundamentals, prices,
sentiment, or annotated news.
Bigdata.com SDK + REST Toolkit
Get the structured substrate the Bigdata.com MCP server doesn't hand over. The
MCP returns clean prose and pre-synthesized tearsheets, but its search tool
gives chunks with no per-chunk sentiment or entity spans, and its tearsheets
give aggregate values — not the fiscal-period time series, universe screener, or
per-field JSON you'd build a pipeline on. The official bigdata-client SDK plus
a thin REST passthrough over the same backend, same JWT reach the official
/v1/* endpoints that hold it. This skill bundles a toolkit that does exactly
that — already debugged, already cost-guarded — so you don't re-pay the
discovery cost.
The core problem this solves (read this first)
The Bigdata MCP server answers "what's the sentiment around NVIDIA?" with a readable paragraph or a pre-synthesized tearsheet — genuinely useful for a chat turn. But the moment you need the machine-readable substrate to build a pipeline on, the MCP doesn't hand it over:
- its search tool returns chunks with text + relevance only — no per-chunk sentiment number, no entity character spans;
- its tearsheets give aggregate values (a single sentiment score, a summary of estimates) — not a fiscal-period time series you can compute on, a universe screener, or per-field JSON.
The fix is a general pattern, not a Bigdata trick:
When an MCP data source returns only synthesized output but you need the structured fields underneath, drop to the vendor SDK or REST. MCP optimizes for a chat turn, not a pipeline.
Crucially, for Bigdata these structured fields are official, publicly
documented REST endpoints (docs.bigdata.com/api-reference/...), not a hidden
backend — and Bigdata is sunsetting the SDK (EOL 2026-12-31) in favour of this
REST API, so the REST layer here is the forward-compatible path, not a hack.
The SDK (bigdata_client.Bigdata) covers search + knowledge-graph; bd._api.http
reaches every /v1/* endpoint the SDK never wrapped. The bundled
bigdata_toolkit packages both behind one BigdataClient.
When to use this skill
Trigger on any of these, in any language:
- The user is using Bigdata.com / RavenPack and the MCP result feels thin — "where's the sentiment score?", "I need entity-level data", "the calendar".
- They want forward / structured financials for a ticker: analyst estimates, earnings or event calendar, earnings surprise, analyst ratings, price targets, a company screener / universe.
- They want annotated news chunks with numeric sentiment + entity spans, or a sentiment time series / co-mention graph.
- They mention a
bd_v2_API key,rp_entity_id,query_unit/ chunk cost,bigdata-client, or "the bigdata MCP isn't enough". - They're building an investment-research dataset and need a reusable, cost-aware data-pull layer rather than one-off MCP calls.
Setup (one time)
1 — API key (never hardcode it). The client fail-fasts if it's missing:
export BIGDATA_API_KEY=bd_v2_xxxxxxxx
2 — An isolated Python env with the official SDK. The bundled toolkit
imports bigdata_client; install it once:
uv venv .venv --python 3.12
uv pip install --python .venv/bin/python bigdata-client
# Behind a slow/blocked PyPI (e.g. mainland China) add a mirror, and unset any
# outbound proxy for the install step so uv reaches the index directly:
# --index-url https://pypi.tuna.tsinghua.edu.cn/simple
3 — Outbound proxy (only if your network needs one to reach
api.bigdata.com). Two equivalent options — the official SDK accepts both: an
env var, or BigdataClient(proxy=...) in code. The env var is simplest:
export HTTPS_PROXY=http://<host>:<port> # plus WSS_PROXY for chat/WebSocket
If a proxy does TLS interception (self-signed CA) and you hit SSL handshake
errors, the official fix is BigdataClient(verify_ssl="<proxy-CA>.pem") — not
blind retries.
4 — Make the bundled package importable by putting this skill's scripts/
on PYTHONPATH (or sys.path.insert(0, "<this-skill>/scripts")).
Smoke-test the whole path (entity resolve + quota are free; --with-search
adds one ~1 query_unit chunk search):
BIGDATA_API_KEY=bd_v2_xxx PYTHONPATH=scripts .venv/bin/python scripts/probe_example.py
Quickstart
import sys
sys.path.insert(0, "<this-skill>/scripts") # so `import bigdata_toolkit` resolves
from bigdata_toolkit import (
BigdataClient, EntityResolver, AnnotatedSearcher,
StructuredDataREST, CostTracker, CostModel, rc, # rc = SSL-retry wrapper
)
c = BigdataClient() # SDK + REST escape hatch, one object
er = EntityResolver(c)
nvda = rc(lambda: er.resolve_id("NVIDIA", country="US")) # -> 'E09E2B' (rp_entity_id is the gateway key)
# --- Structured financials the MCP does NOT expose (REST escape hatch) ---
rest = StructuredDataREST(c)
est = rc(lambda: rest.analyst_estimates(nvda, period="quarter", limit=5)) # forward consensus
surp = rc(lambda: rest.latest_surprise(nvda)) # last EPS/revenue surprise
cal = rc(lambda: rest.events_calendar(nvda, categories=["earnings-call"],
start_date="2026-06-01", end_date="2026-12-31"))
# --- Annotated chunks the MCP STRIPS: sentiment + entity spans (cost-guarded) ---
s = AnnotatedSearcher(c)
docs = rc(lambda: s.search_entity(nvda, keyword="data center", chunk_limit=10))
# each chunk dict: {"sentiment": float, "entities": [{"key": rp_id, "start", "end"}], "text", ...}
# --- Always know your spend (chunk-billed; see Cost discipline) ---
ct = CostTracker(c); ct.snapshot()
# ... run a batch ...
print(ct.delta()) # {'delta_chunks':..., 'delta_query_units':..., 'usd_fast':...}
Wrap every network call in rc(lambda: ...) — a first-handshake SSL: UNEXPECTED_EOF is common and the SDK's internal retry doesn't cover it.
Routing — which capability answers the question
| The user wants… | Use | Module |
|---|---|---|
Company name / ISIN / CUSIP / SEDOL → rp_entity_id |
EntityResolver.resolve_id / .resolve_by_isin |
kg.py (SDK) |
| Forward analyst consensus (revenue/EPS by fiscal period) | StructuredDataREST.analyst_estimates |
rest_ext.py |
| Latest earnings surprise (actual vs estimate) | .latest_surprise |
rest_ext.py |
| Upcoming earnings / event calendar (one name or whole market) | .events_calendar |
rest_ext.py |
| Analyst ratings / price-target consensus | .analyst_ratings / .price_target |
rest_ext.py |
| Full financial statements (income / balance / cash-flow, multi-year) | .income_statement / .balance_sheet / .cash_flow_statement |
rest_ext.py |
| TTM valuation metrics & ratios (EV/EBITDA, ROE, P/E, margins) | .key_metrics_ttm / .company_ratios_ttm |
rest_ext.py |
| Company profile (CEO, sector, employees, IPO date) | .company_profile |
rest_ext.py |
| Daily OHLC prices / dividend history | .daily_prices / .dividends |
rest_ext.py |
| Revenue by geography / product segment | .revenue_geographic_segments / .revenue_product_segments |
rest_ext.py |
| Daily entity-sentiment time series (don't self-aggregate from chunks!) | .entity_sentiment |
rest_ext.py |
| Co-mention graph (supply-chain / competitor / customer — ⚠️ chunk-billed) | .connected_entities |
rest_ext.py |
| Build a universe by market-cap / sector / country | .company_screener |
rest_ext.py |
| News/filing/transcript chunks with sentiment + entity spans | AnnotatedSearcher.search_entity |
search.py (SDK) |
| Bulk-pull many searches 50% cheaper (portfolio backfill) | BatchSearch (create→upload→poll→download) |
rest_ext.py |
| Track / forecast quota spend before a backfill | CostTracker / CostModel |
cost.py |
| Hit an endpoint the toolkit hasn't wrapped yet | client.http.post("v1/<resource>/query", body) |
client.py |
income/balance/cash-flow/daily-prices/dividends/revenue-segmentsreturn{fields, values}— wrap them infields_values_to_records()to get[{field: value}]. The*_ttm/company_profileendpoints are already flat. All structured endpoints above are free (0 chunks) exceptconnected_entitiesandAnnotatedSearcher(chunk-billed).
The two data faces (do NOT say "Bigdata fails for Chinese / A-shares")
This split is the most important non-obvious conclusion — state it precisely:
| Face | Path | A-share / Chinese verdict |
|---|---|---|
| Structured financial (estimates, calendar, surprise, ratings, target, screener, financials, prices, dividends, revenue segments, daily entity-sentiment) | REST (rest_ext.py) |
Works — via rp_entity_id resolved from the English name or ISIN (not the Chinese name). Data is fresh. Minor holes (some A-share price-targets return the entity with no numeric target). The daily entity_sentiment series lives here and works for any resolvable entity — it is not the dead end below. |
| Unstructured Chinese NLP (Chinese-news entity detection, per-chunk Chinese sentiment) | SDK search (search.py) |
Dead end — a data-source-level gap, not an SDK bug: Chinese entity detection ≈ 0, per-chunk CJK sentiment is a doc-level inherited value, and language mislabels Chinese filings as English. Pair Bigdata with a China-domestic source for Chinese-language chunk content; use Bigdata for the structured face (incl. aggregate entity_sentiment) + ISIN/KG crosswalk + English-language chunk sentiment. |
Cost discipline
1 query_unit = 10 chunks (official). Only chunk-search is billed — the
structured /v1/* endpoints (estimates, financials, prices, calendar, surprise,
ratings, the sentiment time series, screener…) are free (0 chunks,
contract-tested). connected_entities (co-mentions) and AnnotatedSearcher
are chunk-billed.
Three levers when you do pay for chunks:
ChunkLimit, never a bareint.Search.run(int)is a document limit billed by the full chunk page;ChunkLimit(n)bills per chunk.AnnotatedSearcher.searchforcesChunkLimitfor you. (We observed roughly a 52x gap once — a single measured data point, not stated in the official docs; treat the exact multiple as indicative. The rule "useChunkLimit" holds regardless, becausemax_chunksis the official billing unit.)- Rerank bills only the returned chunks (official) — pass a
rerank_thresholdto recall broadly but pay only for the high-relevance hits. - Batch search is 50% cheaper (
$0.0075vs$0.015/ qu) — useBatchSearchfor a large multi-query backfill.
Use CostModel to veto an over-budget job before running it, and
CostTracker.snapshot() / delta() to measure real spend. Full accounting →
references/cost_accounting.md.
Known pitfalls (already solved — don't re-debug these)
Each cost real debugging time and is fixed or guarded in the toolkit. Full
reproductions and fixes in references/known_pitfalls.md:
- First-handshake
SSL: UNEXPECTED_EOF→ wrap calls inrc(); the SDK's urllib3 retry only covers HTTP status, not the SSL EOF. All(entity, Keyword(kw))raisesTypeError→ combine with the&operator (entity & Keyword(kw));Alltakes a single iterable. (Fixed inAnnotatedSearcher.entity_query.)- The 52x doc-limit billing trap → always
ChunkLimit, never a bareint. - Closure capture in loops → bind loop vars:
rc(lambda q=q, dr=dr: ...). analyst_estimates(period="quarter")400s abovelimit≈20.company_screenerfilters must nest under"filters"— flat top-level keys don't 400, they're silently dropped → unfiltered universe.Document.reporting_periodis alwaysNone(the SDK model drops a field present on the REST wire) →fetch_reporting_period_raw.
What this skill will not do
- Never hardcode an API key.
BigdataClientreadsBIGDATA_API_KEYand fail-fasts if absent — no plaintext fallback (that is exactly the pattern secret scanners catch). - Only ever reads — never writes or uploads. Every method is a read-only
query (
uploadsisNotImplementedErrorin API-key mode anyway), so the toolkit can't mutate your account or push data anywhere. - Never invent an endpoint or a schema. Every signature here is runtime
L4-verified or marked L3 (doc-confirmed, not yet run); see
references/verified_api_signatures.md. For a new endpoint, confirm the path viadocs.bigdata.com/llms.txtrather than guessing.
File layout
bigdata-skill/
├── SKILL.md # this file — routing + setup + quickstart
├── scripts/
│ ├── bigdata_toolkit/ # the verified, cost-guarded package
│ │ ├── client.py # BigdataClient: SDK (.bd) + REST escape hatch (.http/.conn)
│ │ ├── kg.py # EntityResolver: name/ISIN/CUSIP/SEDOL → rp_entity_id
│ │ ├── search.py # AnnotatedSearcher: chunks + sentiment + entity spans (SDK)
│ │ ├── rest_ext.py # StructuredDataREST (estimates/financials/prices/dividends/sentiment/co-mentions/screener) + BatchSearch + fields_values_to_records — official REST
│ │ ├── cost.py # CostTracker + CostModel: chunk billing + budget veto
│ │ └── retry.py # rc(): SSL/transient-error retry passthrough
│ └── probe_example.py # runnable end-to-end smoke test
└── references/
├── escape_hatch_architecture.md # WHY the MCP is lossy; bd._api.http mechanism; adding endpoints
├── verified_api_signatures.md # L4/L3-verified signatures + the two data faces, with evidence
├── cost_accounting.md # chunk billing, the 52x trap, CostModel/CostTracker, budgeting
└── known_pitfalls.md # every pitfall above, with reproduction + fix
References
| Read when you need to… | File |
|---|---|
Understand why the MCP is insufficient and how the REST escape hatch works (and how to wrap a new /v1/* endpoint) |
references/escape_hatch_architecture.md |
| Look up an exact verified method signature + its verification level | references/verified_api_signatures.md |
| Budget a backfill or debug a surprise quota burn | references/cost_accounting.md |
| Diagnose an error you hit while pulling data | references/known_pitfalls.md |