name: read-file
description: >
Read and explore data files (Parquet, CSV, JSON, Arrow IPC, Avro) locally
or from S3/GCS. Auto-detects format by extension. Uses datafusion-cli for
schema inspection and data preview.
argument-hint: [question about the data]
allowed-tools: Bash
You are helping the user read and analyze a data file using Apache DataFusion.
Filename given: $0
Question: ${1:-describe the data}
Follow these steps in order, stopping and reporting clearly if any step fails.
Step 1 — Classify and resolve the path
Determine whether the input is local or remote:
- S3 URI (
s3://...) → remote - GCS URI (
gs://...) → remote - HTTPS/HTTP URL → remote (DataFusion supports HTTP via object_store)
- Otherwise → local file
Local files
find "$PWD" -name "$0" -not -path '*/.git/*' 2>/dev/null
- Zero results → tell the user the file was not found and stop.
- More than one result → list all matches, ask the user to re-run with a fuller path, and stop.
- Exactly one result → use that full path (
RESOLVED_PATH).
Remote files
Use the URI/URL as-is for RESOLVED_PATH.
For S3 access, DataFusion uses environment variables:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_DEFAULT_REGION- Or
AWS_PROFILEfor profile-based credentials
Check if credentials are available:
test -n "$AWS_ACCESS_KEY_ID" || test -n "$AWS_PROFILE" || test -f "$HOME/.aws/credentials"
If not available, inform the user they need to configure AWS credentials.
Step 2 — Check datafusion-cli is installed
command -v datafusion-cli
If not found, delegate to /datafusion-skills:install-datafusion and then continue.
Step 3 — Detect file format and read
Detect format from extension:
| Extension | Format | DataFusion support |
|---|---|---|
.parquet, .pq |
Parquet | Direct query: SELECT * FROM 'file.parquet' |
.csv, .tsv, .txt |
CSV | Direct query: SELECT * FROM 'file.csv' |
.json, .jsonl, .ndjson |
JSON | Direct query: SELECT * FROM 'file.json' |
.arrow, .ipc, .feather |
Arrow IPC | CREATE EXTERNAL TABLE with STORED AS ARROW |
.avro |
Avro | CREATE EXTERNAL TABLE with STORED AS AVRO |
Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple
-c flags for multiple statements, or write a .sql file and use --file.
For Parquet, CSV, and JSON files (direct query):
DataFusion v44+ supports direct queries on Parquet, CSV, and JSON files by path:
datafusion-cli -c "DESCRIBE 'RESOLVED_PATH';"
datafusion-cli -c "SELECT COUNT(*) AS row_count FROM 'RESOLVED_PATH';"
datafusion-cli -c "SELECT * FROM 'RESOLVED_PATH' LIMIT 10;"
For CSV files with non-standard delimiters or no header, fall back to CREATE EXTERNAL TABLE
using a .sql file:
cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS CSV LOCATION 'RESOLVED_PATH' OPTIONS ('has_header' 'false', 'delimiter' '\t');
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql
For Arrow IPC files:
cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS ARROW LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql
For Avro files:
cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS AVRO LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql
Unknown format
If the extension doesn't match any known format:
- Try Parquet first (most common in data engineering)
- Then try CSV with auto-detection
- Report the error and suggest the user specify the format
Step 4 — Handle errors
datafusion-cli: command not found→ invoke/datafusion-skills:install-datafusionand retry- File not found → double-check the path, suggest using absolute path
- Parse error on CSV → try different options:
OPTIONS ('has_header' 'false'), orOPTIONS ('delimiter' '\t')for TSV - S3 access denied → remind user to configure AWS credentials
- Persistent error → use
/datafusion-skills:datafusion-docs <error keywords>for help
Step 5 — Answer the question
Using the schema, row count, and sample rows gathered above, answer:
${1:-describe the data: summarize column types, row count, and any notable patterns.}
Be concise but thorough — mention:
- Number of columns and their types
- Row count
- Any notable patterns in the sample (nulls, date ranges, value distributions)
Step 6 — Suggest next steps
After answering, suggest relevant follow-ups:
To query this data further — filter, aggregate, join — use
/datafusion-skills:query.
If the file is useful for repeated access:
To register this as a persistent table, run
/datafusion-skills:create-table RESOLVED_PATH.
If the data is large and the user might want to materialize a summary:
To persist a summary as a Parquet file, try
/datafusion-skills:materialized-view.
Keep suggestions brief and show them only once.
Cross-skill integration
- Query follow-ups: Suggest
/datafusion-skills:queryfor further exploration - Table registration: Suggest
/datafusion-skills:create-tablefor persistent access - Error troubleshooting: Use
/datafusion-skills:datafusion-docsfor unclear errors