name: sql-workflow description: "Use this skill before writing any SQL query. Covers: output shape inference (cardinality clues from the question), efficient schema exploration, iterative CTE-based query building, structured verification loop (row count, NULL audit, fan-out check, sample inspection), error recovery protocol, saving output to result.sql and result.csv, turn budget management, and common benchmark traps." type: skill
SQL Workflow Skill
0. Load Knowledge Base Context FIRST
Project-specific conventions, decisions, and quirks live in the Knowledge Base. Always consult before exploring the schema.
Step 0a — get_knowledge
Call once at the start of every task with a 1-line task_description. Returns the always-loaded baseline (org/project understanding + conventions) plus up to 5 task-relevant decisions/debugging/quirks. Treat the returned ## title blocks as authoritative for naming, joins, and known traps.
Step 0b — search_knowledge(query=...)
Call when you hit something unexpected — column meaning unclear, ambiguous join, surprising row count. Pass a 2–4 word query. It is a pure read — no side effects.
Step 0c — propose_knowledge
Call ONLY after you have completed work and verified a finding. Use it to record:
category="decisions"for choices made (auto-accepted).category="debugging"for root-cause traps you hit and resolved (auto-accepted).category="quirks"(scope=connection) for connector/dialect oddities (auto-accepted).- Do NOT propose
understanding— humans only. Do NOT proposeconventionsordomain-rulesas part of automated runs unless explicitly asked (these queue for human review). - Title must be a slug (
^[a-z0-9-]+$, ≤120 chars). Body is markdown. - On duplicate-title: re-call with
overwrite=trueonly if the prior doc is genuinely outdated.
What NOT to do
- Do not paste raw KB text back into SQL comments — reference the doc title instead.
- Do not call
propose_knowledgemid-exploration — only after success.
1. Schema Exploration — Do This First
Before writing any SQL, load KB context (Phase 0 above) then understand the data:
- Read local schema files first (if schema/ directory exists in workdir):
schema/DDL.csv— all CREATE TABLE statements (if it exists)schema/{table_name}.json— column names, types, descriptions, sample values Reading these files costs zero tool calls and gives you table structure + sample data. Only call MCP tools for information not in the local files (e.g., row counts, live data exploration).
- Call
list_tablesto get all schemas and tables — only if no local schema files exist or you need row counts. - Call
describe_tableon the tables that seem relevant to the question (only if JSON files lack detail) - Call
explore_columnon categorical columns to see distinct values (for filtering/grouping) - Call
find_join_pathif you need to join tables and the relationship is unclear
Stop exploring after 3-5 tool calls. Write SQL based on what you've found.
2. Output Shape Inference — Before Writing SQL
Read the task question carefully for cardinality clues:
- "for each X" → GROUP BY X, one output row per X
- "top N" / "top 5" → LIMIT N or QUALIFY RANK() <= N
- "total / sum / average" → single row aggregate
- "list all" → detail rows, no aggregation
- "how many" → COUNT, result is 1 row 1 column
Write a comment at the top of your SQL:
-- EXPECTED: <row count estimate> rows because <reason from question>
Critical checks:
- If the question asks for a single number, the result MUST be 1 row × 1 column
- If the question says "how many", verify the CSV has exactly 1 row with a COUNT value
- If "top N" appears in the question, verify the CSV has at most N rows
3. Iterative Query Building — Build Bottom-Up
Do NOT write a 50-line query and run it all at once:
- Write the innermost subquery or first CTE first
- Run it standalone with
query_database— verify row count and sample values - Add the next CTE, verify again
- Continue until the full query is built
Example incremental pattern:
-- Step 1: verify source
SELECT COUNT(*) FROM orders WHERE status = 'completed';
-- Step 2: verify join partner cardinality
SELECT COUNT(*), COUNT(DISTINCT customer_id) FROM orders;
-- Step 3: build first CTE, verify
WITH order_totals AS (
SELECT customer_id, SUM(amount) AS total
FROM orders
GROUP BY customer_id
)
SELECT COUNT(*), COUNT(DISTINCT customer_id) FROM order_totals;
-- Step 4: add final aggregation
4. Execution and Structured Verification
mcp__signalpilot__query_database
connection_name="<task_connection_name>"
sql="SELECT ..."
After executing, run these checks IN ORDER before saving:
- Row count sanity: Does 0 rows make sense? Does 1M rows make sense for a "top 10" question?
- Column count: Does the result have the right number of columns for the question?
- NULL audit: For each key column — unexpected NULLs indicate wrong JOINs:
SELECT COUNT(*) - COUNT(col) AS nulls FROM (your_query) t - Sample inspection: Look at 5 rows — are values in expected ranges? Do string columns have meaningful values (not join keys)?
- Fan-out check: If JOINing, compare
COUNT(*)vsCOUNT(DISTINCT primary_key):
If they differ, you have duplicate rows from a fan-out JOIN.SELECT COUNT(*) AS total_rows, COUNT(DISTINCT <pk>) AS unique_keys FROM (your_query) t; - Re-read the question: Does your output actually answer what was asked?
5. Error Recovery Protocol
- Syntax error: Use
validate_sqlbeforequery_databaseto catch errors without burning a query turn - Wrong results: Do NOT just re-run the same query. Diagnose: which JOIN is wrong? Which filter is too aggressive?
- Zero rows: Binary-search your WHERE conditions — remove them one at a time to find the culprit:
SELECT COUNT(*) FROM table WHERE cond_1; -- still same? keep it SELECT COUNT(*) FROM table WHERE cond_1 AND cond_2; -- drops? cond_2 is culprit - Too many rows: Check for fan-out (duplicate join keys) or missing GROUP BY
- CTE debugging: Use
debug_cte_queryto run each CTE independently and find which step breaks
6. Saving Output
Once you have the correct result:
Write final SQL to
result.sql:Write tool: path="result.sql", content="<your SQL query>"Write the result as CSV to
result.csv:Write tool: path="result.csv", content="col1,col2,...\nval1,val2,..."- Always include a header row with column names
- Use comma as delimiter
- Quote string values that contain commas or newlines
7. Turn Budget Management
- First 3 turns: Schema exploration only (
schema_overview,describe_tableon 2-3 tables,explore_columnon key categorical columns). STOP exploring. - Turns 4 through (N-3): Write query iteratively — execute and verify each step.
- Last 3 turns: Finalize
result.sqlandresult.csv. If you have a working query, SAVE IT NOW — do not keep iterating.
If your query works and passes all verification checks, SAVE IMMEDIATELY — do not continue exploring "just in case".
8. Common Benchmark Traps
- Rounding: Do NOT round unless the question explicitly asks for rounded values. The evaluator uses tolerance-based comparison — full precision is always safer.
- Column naming: Match the question's phrasing exactly. If the question says "total revenue", name the column
total_revenue, notsum_revenueorrevenue_total. - CSV format: No trailing newline, no BOM, comma delimiter, double-quote strings containing commas.
- Empty result: If the correct answer is 0 or empty, write a CSV with just the header row (or header + "0").
- Date/time format in CSV: Use ISO 8601 (
YYYY-MM-DD) unless the question specifies otherwise. - String case in CSV: Preserve the case from the database — do not uppercase/lowercase unless the question explicitly asks.
- Fan-out from JOINs: Always check
COUNT(*) vs COUNT(DISTINCT key)after every JOIN - Wrong NULL handling: Use
IS NULL/IS NOT NULL, not= NULL - Date format mismatch: Check the actual format stored in the column with
explore_column - Case sensitivity: Use the correct case-insensitive function for your backend
- Interpretation errors: Before saving, re-read the original question. Verify:
- Filter conditions match domain values (check with explore_column if unsure)
- "Excluding X" means the right thing (NOT IN vs EXCEPT vs WHERE NOT)
- Metrics match domain definitions (e.g., "scored points" in F1 = points > 0, not just participated)