read-file - SKILL.md Agent Skill

name: read-file description: > Read and explore data files (Parquet, CSV, JSON, Arrow IPC, Avro) locally or from S3/GCS. Auto-detects format by extension. Uses datafusion-cli for schema inspection and data preview. argument-hint: [question about the data] allowed-tools: Bash

You are helping the user read and analyze a data file using Apache DataFusion.

Filename given: $0 Question: ${1:-describe the data}

Follow these steps in order, stopping and reporting clearly if any step fails.

Step 1 — Classify and resolve the path

Determine whether the input is local or remote:

S3 URI (s3://...) → remote
GCS URI (gs://...) → remote
HTTPS/HTTP URL → remote (DataFusion supports HTTP via object_store)
Otherwise → local file

Local files

find "$PWD" -name "$0" -not -path '*/.git/*' 2>/dev/null

Zero results → tell the user the file was not found and stop.
More than one result → list all matches, ask the user to re-run with a fuller path, and stop.
Exactly one result → use that full path (RESOLVED_PATH).

Remote files

Use the URI/URL as-is for RESOLVED_PATH.

For S3 access, DataFusion uses environment variables:

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
Or AWS_PROFILE for profile-based credentials

Check if credentials are available:

test -n "$AWS_ACCESS_KEY_ID" || test -n "$AWS_PROFILE" || test -f "$HOME/.aws/credentials"

If not available, inform the user they need to configure AWS credentials.

Step 2 — Check datafusion-cli is installed

command -v datafusion-cli

If not found, delegate to /datafusion-skills:install-datafusion and then continue.

Step 3 — Detect file format and read

Detect format from extension:

Extension	Format	DataFusion support
`.parquet`, `.pq`	Parquet	Direct query: `SELECT * FROM 'file.parquet'`
`.csv`, `.tsv`, `.txt`	CSV	Direct query: `SELECT * FROM 'file.csv'`
`.json`, `.jsonl`, `.ndjson`	JSON	Direct query: `SELECT * FROM 'file.json'`
`.arrow`, `.ipc`, `.feather`	Arrow IPC	`CREATE EXTERNAL TABLE` with `STORED AS ARROW`
`.avro`	Avro	`CREATE EXTERNAL TABLE` with `STORED AS AVRO`

Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple -c flags for multiple statements, or write a .sql file and use --file.

For Parquet, CSV, and JSON files (direct query):

DataFusion v44+ supports direct queries on Parquet, CSV, and JSON files by path:

datafusion-cli -c "DESCRIBE 'RESOLVED_PATH';"

datafusion-cli -c "SELECT COUNT(*) AS row_count FROM 'RESOLVED_PATH';"

datafusion-cli -c "SELECT * FROM 'RESOLVED_PATH' LIMIT 10;"

For CSV files with non-standard delimiters or no header, fall back to CREATE EXTERNAL TABLE using a .sql file:

cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS CSV LOCATION 'RESOLVED_PATH' OPTIONS ('has_header' 'false', 'delimiter' '\t');
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql

For Arrow IPC files:

cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS ARROW LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql

For Avro files:

cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS AVRO LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql

Unknown format

If the extension doesn't match any known format:

Try Parquet first (most common in data engineering)
Then try CSV with auto-detection
Report the error and suggest the user specify the format

Step 4 — Handle errors

datafusion-cli: command not found → invoke /datafusion-skills:install-datafusion and retry
File not found → double-check the path, suggest using absolute path
Parse error on CSV → try different options: OPTIONS ('has_header' 'false'), or OPTIONS ('delimiter' '\t') for TSV
S3 access denied → remind user to configure AWS credentials
Persistent error → use /datafusion-skills:datafusion-docs <error keywords> for help

Step 5 — Answer the question

Using the schema, row count, and sample rows gathered above, answer:

${1:-describe the data: summarize column types, row count, and any notable patterns.}

Be concise but thorough — mention:

Number of columns and their types
Row count
Any notable patterns in the sample (nulls, date ranges, value distributions)

Step 6 — Suggest next steps

After answering, suggest relevant follow-ups:

To query this data further — filter, aggregate, join — use /datafusion-skills:query.

If the file is useful for repeated access:

To register this as a persistent table, run /datafusion-skills:create-table RESOLVED_PATH.

If the data is large and the user might want to materialize a summary:

To persist a summary as a Parquet file, try /datafusion-skills:materialized-view.

Keep suggestions brief and show them only once.

Cross-skill integration

Query follow-ups: Suggest /datafusion-skills:query for further exploration
Table registration: Suggest /datafusion-skills:create-table for persistent access
Error troubleshooting: Use /datafusion-skills:datafusion-docs for unclear errors