query - SKILL.md Agent Skill

name: query description: > Run SQL queries against registered tables or ad-hoc against files using datafusion-cli. Accepts raw SQL or natural language questions. Supports Parquet, CSV, JSON, and Arrow IPC files. argument-hint: [--file path] allowed-tools: Bash

You are helping the user query data using Apache DataFusion.

Input: $@

Follow these steps in order.

Step 1 — Resolve state and determine the mode

Look for an existing state file:

STATE_DIR=""
test -f .datafusion-skills/state.sql && STATE_DIR=".datafusion-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.datafusion-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.datafusion-skills/$PROJECT_ID"

If found, verify the state file is valid:

datafusion-cli --file "$STATE_DIR/state.sql" -c "SELECT 1;"

Now determine the mode:

Ad-hoc mode if: the --file flag is present, or the SQL references file paths (e.g. FROM 'data.parquet'), or STATE_DIR is empty.
Session mode if: STATE_DIR is set and the input references table names, is natural language, or is SQL without file references.

If no state file exists and no file is referenced, fall back to ad-hoc mode.

Step 2 — Check datafusion-cli is installed

command -v datafusion-cli

If not found, delegate to /datafusion-skills:install-datafusion and then continue.

Step 3 — Generate SQL if needed

If the input is natural language (not valid SQL), generate SQL.

In session mode, first retrieve the schema to inform query generation:

datafusion-cli --file "$STATE_DIR/state.sql" -c "SHOW TABLES;"

Then for relevant tables:

datafusion-cli --file "$STATE_DIR/state.sql" -c "DESCRIBE <table_name>;"

DataFusion SQL reference

When generating SQL, keep these DataFusion-specific features in mind:

File queries: SELECT * FROM 'path/to/file.parquet' works for Parquet files directly

CREATE EXTERNAL TABLE: Register any file as a table

CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 'data.parquet';
CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 'data.csv' OPTIONS ('has_header' 'true');
CREATE EXTERNAL TABLE t STORED AS JSON LOCATION 'data.json';
CREATE EXTERNAL TABLE t STORED AS ARROW LOCATION 'data.arrow';
CREATE EXTERNAL TABLE t STORED AS AVRO LOCATION 'data.avro';

COPY TO: Export query results to files

COPY (SELECT * FROM t WHERE x > 10) TO 'output.parquet' STORED AS PARQUET;

Window functions: Full support for ROW_NUMBER, RANK, LAG, LEAD, etc.
CTEs: WITH clauses are fully supported
UNNEST: Expand arrays into rows
Aggregate functions: COUNT, SUM, AVG, MIN, MAX, MEDIAN, APPROX_PERCENTILE_CONT, etc.
String functions: LENGTH, LOWER, UPPER, TRIM, SUBSTR, REGEXP_MATCH, etc.
Date/time: NOW(), DATE_TRUNC, DATE_PART, INTERVAL, TO_TIMESTAMP, etc.
EXPLAIN: Use EXPLAIN or EXPLAIN ANALYZE to see query plans

Step 4 — Estimate result size

Before executing, check if the query might produce too many rows.

If the query has no LIMIT and might return a large result, wrap it:

SELECT COUNT(*) AS total_rows FROM (<original_query>);

If total_rows > 1000, add LIMIT 100 and inform the user.

Step 5 — Execute the query

Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple -c flags for multiple statements, or write a .sql file and use --file.

Session mode:

datafusion-cli --file "$STATE_DIR/state.sql" -c "<SQL_QUERY>"

Ad-hoc mode with file:

DataFusion v44+ can query Parquet, CSV, and JSON files directly by path:

datafusion-cli -c "SELECT * FROM '<file_path>' LIMIT 10;"

For Arrow IPC or Avro, create a temporary external table using a .sql file:

cat > /tmp/_df_query.sql << SQL
CREATE EXTERNAL TABLE _tmp STORED AS ARROW LOCATION '<file_path>';
<SQL_QUERY using _tmp>;
SQL
datafusion-cli --file /tmp/_df_query.sql

Ad-hoc mode without file:

datafusion-cli -c "<SQL_QUERY>"

Step 6 — Handle errors

datafusion-cli: command not found → invoke /datafusion-skills:install-datafusion and retry
File not found → check the path, suggest using absolute path
Schema error → run DESCRIBE on the table/file and show the user available columns
Unsupported file format → suggest creating an external table with explicit format
Persistent or unclear error → use /datafusion-skills:datafusion-docs <error keywords> to search documentation

Step 7 — Present results

Present the query results clearly. If the result is a table, format it nicely.

Suggest next steps if appropriate:

For exploratory queries → suggest more specific filters or aggregations
For ad-hoc file queries → suggest /datafusion-skills:create-table to register the file for repeated use
For complex pipelines → suggest /datafusion-skills:materialized-view to persist results

Cross-skill integration

Session state: If a state.sql exists, tables registered by /datafusion-skills:create-table are automatically available
Error troubleshooting: Use /datafusion-skills:datafusion-docs for persistent errors
Performance: Suggest /datafusion-skills:explain-plan for slow queries