cloudflare-r2-logpush-agent - SKILL.md Agent Skill

name: cloudflare-r2-logpush-agent description: "Build an end-to-end Cloudflare HTTP log analytics pipeline in Snowflake with a Cortex Agent. Ingests Logpush data from R2 into a parsed view, semantic view, and queryable agent. Use when: cloudflare logs, R2 logpush, http log analytics, web traffic agent, cloudflare to snowflake. Triggers: cloudflare, logpush, R2, http logs, web analytics agent."

Cloudflare R2 Logpush to Cortex Agent

Build a complete pipeline: Cloudflare Logpush -> R2 -> Snowflake -> Semantic View -> Cortex Agent.

The result is a natural-language queryable agent over your Cloudflare HTTP request logs covering traffic, errors, geography, bots, cache, security/WAF, origin performance, and more.

When NOT to Use

DNS, firewall, or non-HTTP Cloudflare datasets (this skill covers HTTP requests only)
Real-time sub-minute streaming analytics (Logpush is batch-oriented)
Request payload/header inspection (Cloudflare logs scalar metadata, not bodies)

Prerequisites

Active Snowflake connection with ACCOUNTADMIN (or a role with CREATE DATABASE, CREATE AGENT)
Cloudflare Logpush configured to push HTTP request logs to an R2 bucket
R2 API credentials (access key + secret key) for that bucket
A warehouse for queries

Stopping Points

✋ Step 1: Confirm configuration before creating any objects
✋ Step 3: Confirm data loaded and sample rows look correct
✋ Step 5: Confirm semantic view has expected facts/dimensions/metrics
✋ Step 6: Confirm agent returns a meaningful response with data
✋ Step 7: Confirm grants work and scheduled task (if created) is RESUMED

Workflow

Step 1: Gather Configuration

Goal: Collect all parameters needed for the pipeline.

Ask the user for:

R2 credentials:
- R2 endpoint (e.g., abc123.r2.cloudflarestorage.com) — no https:// prefix
- R2 bucket name (e.g., cloudflare-managed-xxxxx)
- R2 access key ID
- R2 secret access key
- Path prefix in the bucket (e.g., 20250101/ or empty for root)
Snowflake naming:
- Database name (e.g., CF_LOGS_DB)
- Schema name (default: HTTP_LOGS)
- Agent name (e.g., MY_SITE_AGENT)
- Warehouse (e.g., COMPUTE_WH)
- Role for agent access grants (e.g., PUBLIC)
Site info:
- Zone name / domain (e.g., example.com) — used in agent instructions

Derive object names from user input:

Raw table: <AGENT_NAME>_HTTP_RAW
Parsed view: <AGENT_NAME>_HTTP_REQUESTS
Semantic view: <AGENT_NAME>_HTTP_ANALYTICS_SEMANTIC
Stage: <AGENT_NAME>_HTTP_STAGE
File format: <AGENT_NAME>_HTTP_JSON_FF

⚠️ STOP: Confirm all configuration values with the user before proceeding.

Step 2: Create Infrastructure

Goal: Create database, schema, file format, external stage, and raw table.

Load references/sql-templates.md for all SQL templates.

Actions:

Run templates 1-4 (database, schema, file format, stage, raw table) with substituted values.
Verify the stage connects to R2:
```
LIST @{{DATABASE}}.{{SCHEMA}}.{{STAGE_NAME}};
```
This should return .gz files. If it fails with access errors, re-check R2 credentials and endpoint.

Troubleshooting:

"Access Denied" → Wrong R2 access key, secret key, or bucket name
"Could not connect" → Wrong endpoint (should be bare hostname, no https://)
No files listed → Wrong path prefix, or Logpush hasn't pushed any files yet

Step 3: Load Data

Goal: Copy NDJSON gzip files from R2 into the raw VARIANT table.

Actions:

Run template 5 (COPY INTO). This is idempotent — Snowflake tracks loaded files for 64 days, so re-running only loads new files.

Verify row count:

SELECT COUNT(*) AS row_count FROM {{DATABASE}}.{{SCHEMA}}.{{RAW_TABLE}};

Spot-check one row:
```
SELECT RAW FROM {{DATABASE}}.{{SCHEMA}}.{{RAW_TABLE}} LIMIT 1;
```
Confirm it contains Cloudflare HTTP request fields (ClientRequestHost, EdgeResponseStatus, etc.).

⚠️ STOP: Confirm row count looks reasonable and sample data is correct.

Step 4: Create Parsed View

Goal: Create a view that extracts all ~75 scalar Cloudflare HTTP request fields from the raw VARIANT into typed columns, plus computed flags.

Actions:

Run template 6 (parsed view). This uses a CTE pattern to avoid SQL column alias referencing issues.

Quick verify:

SELECT COUNT(*) FROM {{DATABASE}}.{{SCHEMA}}.{{VIEW_NAME}};
SELECT * FROM {{DATABASE}}.{{SCHEMA}}.{{VIEW_NAME}} LIMIT 5;

Fields covered (all scalar fields from Cloudflare's HTTP requests dataset):

Timestamps: EDGE_START_TS, EDGE_END_TS, EDGE_HOUR
Request: HOST, URI, PATH, METHOD, PROTOCOL, SCHEME, REFERER, REQUEST_BYTES
Client: CLIENT_IP, COUNTRY, CITY, REGION_CODE, DEVICE_TYPE, USER_AGENT, CLIENT_ASN
Edge response: STATUS, EDGE_RESPONSE_BYTES, TTFB_MS, CONTENT_TYPE, EDGE_COLO_CODE
Cache: CACHE_STATUS, CACHE_RESPONSE_BYTES, CACHE_RESERVE_USED
Origin: ORIGIN_IP, ORIGIN_STATUS, ORIGIN_RESPONSE_TIME_MS, ORIGIN_DNS_MS, ORIGIN_TCP/TLS_HANDSHAKE_MS
Bot detection: BOT_SCORE, BOT_SCORE_SRC, VERIFIED_BOT_CATEGORY, JA3_HASH, JA4
Security/WAF: SECURITY_ACTION, WAF_ATTACK_SCORE, WAF_SQLI_SCORE, WAF_XSS_SCORE, WAF_RCE_SCORE
Workers: WORKER_SCRIPT_NAME, WORKER_CPU_TIME_US, WORKER_WALL_TIME_US
Computed flags: IS_402, IS_4XX, IS_5XX, IS_BOT (score < 30), IS_CACHE_HIT, REQUEST_COUNT (always 1)

Step 5: Create Semantic View

Goal: Create a semantic view with facts, dimensions, and pre-built metrics for the Cortex Agent.

Actions:

Run template 7 (semantic view). Contains:
- 28 facts — numeric columns for aggregation (bytes, latencies, scores, flags)
- 42 dimensions — categorical columns for filtering/grouping
- 16 metrics — pre-built aggregates (total_requests, rate_402, rate_4xx, rate_5xx, bot_rate, cache_hit_rate, avg_ttfb_ms, etc.)
- AI_SQL_GENERATION hint — guides the agent on field semantics (score ranges, status codes, etc.)

Verify:

DESCRIBE SEMANTIC VIEW {{DATABASE}}.{{SCHEMA}}.{{SEMANTIC_VIEW}};

⚠️ STOP: Confirm semantic view has the expected facts/dimensions/metrics.

Step 6: Create Cortex Agent

Goal: Create a Cortex Agent backed by the semantic view.

Actions:

Create the agent using template 8–9 (combined). Run the CREATE AGENT ... FROM SPECIFICATION $$...$$ statement with all placeholders replaced. Use USE ROLE ACCOUNTADMIN (or a role with CREATE AGENT privilege).
Set execution_environment via REST API using template 10.

KNOWN ISSUE: CREATE AGENT FROM SPECIFICATION silently drops execution_environment from tool_resources. Without it, the agent returns error 399504: "missing execution environment." The REST API PUT is the reliable workaround.

The execution_environment must use this nested format:
```
{"type": "warehouse", "warehouse": "COMPUTE_WH"}
```

Test the agent:

SELECT SNOWFLAKE.CORTEX.AGENT('{{DATABASE}}.{{SCHEMA}}.{{AGENT_NAME}}',
  PARSE_JSON('{"messages": [{"role": "user", "content": [{"type": "text", "text": "How many total requests are in the logs?"}]}]}')
);

If this returns error 399504, the execution_environment was not set — re-run template 10.

⚠️ STOP: Confirm agent returns a meaningful response with data.

Step 7: Grant Access and Optional Scheduled Refresh

Goal: Make the agent accessible and optionally set up recurring data loads.

Actions:

Grant access using template 11. Run all GRANT statements for the target role.
(Optional) Create a weekly refresh task using template 12. This runs COPY INTO on a schedule (default: Sundays 6 AM UTC) to keep data current. Because COPY INTO tracks loaded files for 64 days, it only appends new files — no duplicates.
Final verification — ask the agent a question that exercises multiple field categories:
- "What percentage of traffic is from bots?"
- "What is the cache hit rate by country?"
- "Show me the top 10 paths by request count"

⚠️ STOP: Confirm grants work and scheduled task (if created) is RESUMED.

Output

After completing all steps, the user has:

Raw table — VARIANT storage of all Cloudflare HTTP request JSON
Parsed view — ~75 typed columns covering every scalar field plus computed flags
Semantic view — 28 facts, 42 dimensions, 16 metrics with AI hints
Cortex Agent — natural-language queryable over the semantic view
Grants — agent accessible to the specified role
(Optional) Scheduled task — weekly data refresh from R2
Cleanup template — template 13 for tearing down all objects if needed

Gotchas

Stage URL protocol: Uses s3compat:// (not s3://) for Cloudflare R2
execution_environment bug: Must be set via REST API after CREATE AGENT — SQL silently drops it (see template 10)
BOT_SCORE is inverted: 0-99 where lower = more likely bot (< 30 = bot)
WAF scores are inverted: 1-99 where lower = higher attack likelihood