read-file

star 13

Read and explore data files (Parquet, CSV, JSON, Arrow IPC, Avro) locally or from S3/GCS. Auto-detects format by extension. Uses datafusion-cli for schema inspection and data preview.

datafusion-contrib By datafusion-contrib schedule Updated 3/21/2026

name: read-file description: > Read and explore data files (Parquet, CSV, JSON, Arrow IPC, Avro) locally or from S3/GCS. Auto-detects format by extension. Uses datafusion-cli for schema inspection and data preview. argument-hint: [question about the data] allowed-tools: Bash

You are helping the user read and analyze a data file using Apache DataFusion.

Filename given: $0 Question: ${1:-describe the data}

Follow these steps in order, stopping and reporting clearly if any step fails.

Step 1 — Classify and resolve the path

Determine whether the input is local or remote:

  • S3 URI (s3://...) → remote
  • GCS URI (gs://...) → remote
  • HTTPS/HTTP URL → remote (DataFusion supports HTTP via object_store)
  • Otherwise → local file

Local files

find "$PWD" -name "$0" -not -path '*/.git/*' 2>/dev/null
  • Zero results → tell the user the file was not found and stop.
  • More than one result → list all matches, ask the user to re-run with a fuller path, and stop.
  • Exactly one result → use that full path (RESOLVED_PATH).

Remote files

Use the URI/URL as-is for RESOLVED_PATH.

For S3 access, DataFusion uses environment variables:

  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
  • Or AWS_PROFILE for profile-based credentials

Check if credentials are available:

test -n "$AWS_ACCESS_KEY_ID" || test -n "$AWS_PROFILE" || test -f "$HOME/.aws/credentials"

If not available, inform the user they need to configure AWS credentials.

Step 2 — Check datafusion-cli is installed

command -v datafusion-cli

If not found, delegate to /datafusion-skills:install-datafusion and then continue.

Step 3 — Detect file format and read

Detect format from extension:

Extension Format DataFusion support
.parquet, .pq Parquet Direct query: SELECT * FROM 'file.parquet'
.csv, .tsv, .txt CSV Direct query: SELECT * FROM 'file.csv'
.json, .jsonl, .ndjson JSON Direct query: SELECT * FROM 'file.json'
.arrow, .ipc, .feather Arrow IPC CREATE EXTERNAL TABLE with STORED AS ARROW
.avro Avro CREATE EXTERNAL TABLE with STORED AS AVRO

Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple -c flags for multiple statements, or write a .sql file and use --file.

For Parquet, CSV, and JSON files (direct query):

DataFusion v44+ supports direct queries on Parquet, CSV, and JSON files by path:

datafusion-cli -c "DESCRIBE 'RESOLVED_PATH';"
datafusion-cli -c "SELECT COUNT(*) AS row_count FROM 'RESOLVED_PATH';"
datafusion-cli -c "SELECT * FROM 'RESOLVED_PATH' LIMIT 10;"

For CSV files with non-standard delimiters or no header, fall back to CREATE EXTERNAL TABLE using a .sql file:

cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS CSV LOCATION 'RESOLVED_PATH' OPTIONS ('has_header' 'false', 'delimiter' '\t');
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql

For Arrow IPC files:

cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS ARROW LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql

For Avro files:

cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS AVRO LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql

Unknown format

If the extension doesn't match any known format:

  1. Try Parquet first (most common in data engineering)
  2. Then try CSV with auto-detection
  3. Report the error and suggest the user specify the format

Step 4 — Handle errors

  • datafusion-cli: command not found → invoke /datafusion-skills:install-datafusion and retry
  • File not found → double-check the path, suggest using absolute path
  • Parse error on CSV → try different options: OPTIONS ('has_header' 'false'), or OPTIONS ('delimiter' '\t') for TSV
  • S3 access denied → remind user to configure AWS credentials
  • Persistent error → use /datafusion-skills:datafusion-docs <error keywords> for help

Step 5 — Answer the question

Using the schema, row count, and sample rows gathered above, answer:

${1:-describe the data: summarize column types, row count, and any notable patterns.}

Be concise but thorough — mention:

  • Number of columns and their types
  • Row count
  • Any notable patterns in the sample (nulls, date ranges, value distributions)

Step 6 — Suggest next steps

After answering, suggest relevant follow-ups:

To query this data further — filter, aggregate, join — use /datafusion-skills:query.

If the file is useful for repeated access:

To register this as a persistent table, run /datafusion-skills:create-table RESOLVED_PATH.

If the data is large and the user might want to materialize a summary:

To persist a summary as a Parquet file, try /datafusion-skills:materialized-view.

Keep suggestions brief and show them only once.

Cross-skill integration

  • Query follow-ups: Suggest /datafusion-skills:query for further exploration
  • Table registration: Suggest /datafusion-skills:create-table for persistent access
  • Error troubleshooting: Use /datafusion-skills:datafusion-docs for unclear errors
Install via CLI
npx skills add https://github.com/datafusion-contrib/datafusion-skills --skill read-file
Repository Details
star Stars 13
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
datafusion-contrib
datafusion-contrib Explore all skills →