name: query
description: >
Run SQL queries against registered tables or ad-hoc against files using
datafusion-cli. Accepts raw SQL or natural language questions. Supports
Parquet, CSV, JSON, and Arrow IPC files.
argument-hint: [--file path]
allowed-tools: Bash
You are helping the user query data using Apache DataFusion.
Input: $@
Follow these steps in order.
Step 1 — Resolve state and determine the mode
Look for an existing state file:
STATE_DIR=""
test -f .datafusion-skills/state.sql && STATE_DIR=".datafusion-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.datafusion-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.datafusion-skills/$PROJECT_ID"
If found, verify the state file is valid:
datafusion-cli --file "$STATE_DIR/state.sql" -c "SELECT 1;"
Now determine the mode:
- Ad-hoc mode if: the
--fileflag is present, or the SQL references file paths (e.g.FROM 'data.parquet'), orSTATE_DIRis empty. - Session mode if:
STATE_DIRis set and the input references table names, is natural language, or is SQL without file references.
If no state file exists and no file is referenced, fall back to ad-hoc mode.
Step 2 — Check datafusion-cli is installed
command -v datafusion-cli
If not found, delegate to /datafusion-skills:install-datafusion and then continue.
Step 3 — Generate SQL if needed
If the input is natural language (not valid SQL), generate SQL.
In session mode, first retrieve the schema to inform query generation:
datafusion-cli --file "$STATE_DIR/state.sql" -c "SHOW TABLES;"
Then for relevant tables:
datafusion-cli --file "$STATE_DIR/state.sql" -c "DESCRIBE <table_name>;"
DataFusion SQL reference
When generating SQL, keep these DataFusion-specific features in mind:
- File queries:
SELECT * FROM 'path/to/file.parquet'works for Parquet files directly - CREATE EXTERNAL TABLE: Register any file as a table
CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 'data.parquet'; CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 'data.csv' OPTIONS ('has_header' 'true'); CREATE EXTERNAL TABLE t STORED AS JSON LOCATION 'data.json'; CREATE EXTERNAL TABLE t STORED AS ARROW LOCATION 'data.arrow'; CREATE EXTERNAL TABLE t STORED AS AVRO LOCATION 'data.avro'; - COPY TO: Export query results to files
COPY (SELECT * FROM t WHERE x > 10) TO 'output.parquet' STORED AS PARQUET; - Window functions: Full support for
ROW_NUMBER,RANK,LAG,LEAD, etc. - CTEs:
WITHclauses are fully supported - UNNEST: Expand arrays into rows
- Aggregate functions:
COUNT,SUM,AVG,MIN,MAX,MEDIAN,APPROX_PERCENTILE_CONT, etc. - String functions:
LENGTH,LOWER,UPPER,TRIM,SUBSTR,REGEXP_MATCH, etc. - Date/time:
NOW(),DATE_TRUNC,DATE_PART,INTERVAL,TO_TIMESTAMP, etc. - EXPLAIN: Use
EXPLAINorEXPLAIN ANALYZEto see query plans
Step 4 — Estimate result size
Before executing, check if the query might produce too many rows.
If the query has no LIMIT and might return a large result, wrap it:
SELECT COUNT(*) AS total_rows FROM (<original_query>);
If total_rows > 1000, add LIMIT 100 and inform the user.
Step 5 — Execute the query
Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple
-c flags for multiple statements, or write a .sql file and use --file.
Session mode:
datafusion-cli --file "$STATE_DIR/state.sql" -c "<SQL_QUERY>"
Ad-hoc mode with file:
DataFusion v44+ can query Parquet, CSV, and JSON files directly by path:
datafusion-cli -c "SELECT * FROM '<file_path>' LIMIT 10;"
For Arrow IPC or Avro, create a temporary external table using a .sql file:
cat > /tmp/_df_query.sql << SQL
CREATE EXTERNAL TABLE _tmp STORED AS ARROW LOCATION '<file_path>';
<SQL_QUERY using _tmp>;
SQL
datafusion-cli --file /tmp/_df_query.sql
Ad-hoc mode without file:
datafusion-cli -c "<SQL_QUERY>"
Step 6 — Handle errors
datafusion-cli: command not found→ invoke/datafusion-skills:install-datafusionand retry- File not found → check the path, suggest using absolute path
- Schema error → run
DESCRIBEon the table/file and show the user available columns - Unsupported file format → suggest creating an external table with explicit format
- Persistent or unclear error → use
/datafusion-skills:datafusion-docs <error keywords>to search documentation
Step 7 — Present results
Present the query results clearly. If the result is a table, format it nicely.
Suggest next steps if appropriate:
- For exploratory queries → suggest more specific filters or aggregations
- For ad-hoc file queries → suggest
/datafusion-skills:create-tableto register the file for repeated use - For complex pipelines → suggest
/datafusion-skills:materialized-viewto persist results
Cross-skill integration
- Session state: If a
state.sqlexists, tables registered by/datafusion-skills:create-tableare automatically available - Error troubleshooting: Use
/datafusion-skills:datafusion-docsfor persistent errors - Performance: Suggest
/datafusion-skills:explain-planfor slow queries