query

star 13

Run SQL queries against registered tables or ad-hoc against files using datafusion-cli. Accepts raw SQL or natural language questions. Supports Parquet, CSV, JSON, and Arrow IPC files.

datafusion-contrib By datafusion-contrib schedule Updated 3/21/2026

name: query description: > Run SQL queries against registered tables or ad-hoc against files using datafusion-cli. Accepts raw SQL or natural language questions. Supports Parquet, CSV, JSON, and Arrow IPC files. argument-hint: [--file path] allowed-tools: Bash

You are helping the user query data using Apache DataFusion.

Input: $@

Follow these steps in order.

Step 1 — Resolve state and determine the mode

Look for an existing state file:

STATE_DIR=""
test -f .datafusion-skills/state.sql && STATE_DIR=".datafusion-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.datafusion-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.datafusion-skills/$PROJECT_ID"

If found, verify the state file is valid:

datafusion-cli --file "$STATE_DIR/state.sql" -c "SELECT 1;"

Now determine the mode:

  • Ad-hoc mode if: the --file flag is present, or the SQL references file paths (e.g. FROM 'data.parquet'), or STATE_DIR is empty.
  • Session mode if: STATE_DIR is set and the input references table names, is natural language, or is SQL without file references.

If no state file exists and no file is referenced, fall back to ad-hoc mode.

Step 2 — Check datafusion-cli is installed

command -v datafusion-cli

If not found, delegate to /datafusion-skills:install-datafusion and then continue.

Step 3 — Generate SQL if needed

If the input is natural language (not valid SQL), generate SQL.

In session mode, first retrieve the schema to inform query generation:

datafusion-cli --file "$STATE_DIR/state.sql" -c "SHOW TABLES;"

Then for relevant tables:

datafusion-cli --file "$STATE_DIR/state.sql" -c "DESCRIBE <table_name>;"

DataFusion SQL reference

When generating SQL, keep these DataFusion-specific features in mind:

  • File queries: SELECT * FROM 'path/to/file.parquet' works for Parquet files directly
  • CREATE EXTERNAL TABLE: Register any file as a table
    CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 'data.parquet';
    CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 'data.csv' OPTIONS ('has_header' 'true');
    CREATE EXTERNAL TABLE t STORED AS JSON LOCATION 'data.json';
    CREATE EXTERNAL TABLE t STORED AS ARROW LOCATION 'data.arrow';
    CREATE EXTERNAL TABLE t STORED AS AVRO LOCATION 'data.avro';
    
  • COPY TO: Export query results to files
    COPY (SELECT * FROM t WHERE x > 10) TO 'output.parquet' STORED AS PARQUET;
    
  • Window functions: Full support for ROW_NUMBER, RANK, LAG, LEAD, etc.
  • CTEs: WITH clauses are fully supported
  • UNNEST: Expand arrays into rows
  • Aggregate functions: COUNT, SUM, AVG, MIN, MAX, MEDIAN, APPROX_PERCENTILE_CONT, etc.
  • String functions: LENGTH, LOWER, UPPER, TRIM, SUBSTR, REGEXP_MATCH, etc.
  • Date/time: NOW(), DATE_TRUNC, DATE_PART, INTERVAL, TO_TIMESTAMP, etc.
  • EXPLAIN: Use EXPLAIN or EXPLAIN ANALYZE to see query plans

Step 4 — Estimate result size

Before executing, check if the query might produce too many rows.

If the query has no LIMIT and might return a large result, wrap it:

SELECT COUNT(*) AS total_rows FROM (<original_query>);

If total_rows > 1000, add LIMIT 100 and inform the user.

Step 5 — Execute the query

Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple -c flags for multiple statements, or write a .sql file and use --file.

Session mode:

datafusion-cli --file "$STATE_DIR/state.sql" -c "<SQL_QUERY>"

Ad-hoc mode with file:

DataFusion v44+ can query Parquet, CSV, and JSON files directly by path:

datafusion-cli -c "SELECT * FROM '<file_path>' LIMIT 10;"

For Arrow IPC or Avro, create a temporary external table using a .sql file:

cat > /tmp/_df_query.sql << SQL
CREATE EXTERNAL TABLE _tmp STORED AS ARROW LOCATION '<file_path>';
<SQL_QUERY using _tmp>;
SQL
datafusion-cli --file /tmp/_df_query.sql

Ad-hoc mode without file:

datafusion-cli -c "<SQL_QUERY>"

Step 6 — Handle errors

  • datafusion-cli: command not found → invoke /datafusion-skills:install-datafusion and retry
  • File not found → check the path, suggest using absolute path
  • Schema error → run DESCRIBE on the table/file and show the user available columns
  • Unsupported file format → suggest creating an external table with explicit format
  • Persistent or unclear error → use /datafusion-skills:datafusion-docs <error keywords> to search documentation

Step 7 — Present results

Present the query results clearly. If the result is a table, format it nicely.

Suggest next steps if appropriate:

  • For exploratory queries → suggest more specific filters or aggregations
  • For ad-hoc file queries → suggest /datafusion-skills:create-table to register the file for repeated use
  • For complex pipelines → suggest /datafusion-skills:materialized-view to persist results

Cross-skill integration

  • Session state: If a state.sql exists, tables registered by /datafusion-skills:create-table are automatically available
  • Error troubleshooting: Use /datafusion-skills:datafusion-docs for persistent errors
  • Performance: Suggest /datafusion-skills:explain-plan for slow queries
Install via CLI
npx skills add https://github.com/datafusion-contrib/datafusion-skills --skill query
Repository Details
star Stars 13
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
datafusion-contrib
datafusion-contrib Explore all skills →