optimizing-clickhouse-and-hogql-queries - SKILL.md Agent Skill

name: optimizing-clickhouse-and-hogql-queries description: Workflow for optimizing ClickHouse and HogQL queries. Use when a HogQL query, query runner, insight, or report is too slow; when a hand-written ClickHouse query (via `sync_execute` or in a migration) is too slow; when ClickHouse times out or hits memory limits; when investigating a slow `system.query_log` row; or when reviewing a proposed HogQL printer change for performance. Covers extracting the ClickHouse SQL (for HogQL queries), common smells (`FROM ... FINAL`, `JSONExtract` over properties, missing skip indexes, self-joins, CTE blow-up), measuring against a real cluster, and applying the fix at the right layer (printer, query runner, or ClickHouse migration). Does NOT cover Postgres / Django ORM / app-database queries; those need pganalyze and the Postgres section of `query-performance-optimization.md`, not this skill.

Optimizing ClickHouse and HogQL queries

Scope: this skill optimizes ClickHouse queries and HogQL queries (which compile to ClickHouse). It does not optimize Postgres / Django ORM / app-database queries. If the slow query you're holding is a Model.objects.filter(...) or any other call against the app DB, stop and use pganalyze + the Postgres section of query-performance-optimization.md instead. Step 0 below has the full triage.

The best way to optimize a HogQL query is to start with the ClickHouse SQL it produces, optimize that, then translate the change back into the HogQL query, the query runner, the HogQL printer, or a ClickHouse migration. Skip ahead to the SQL; reasoning about HogQL alone hides what ClickHouse will actually execute.

This skill assumes you already know how to write HogQL. For writing new ClickHouse-backed queries from scratch, use /writing-clickhouse-queries first. For migration mechanics, use /clickhouse-migrations.

Optimizing every query a team owns

Sometimes the job isn't one slow query — it's "optimize all the ClickHouse/HogQL queries owned by team X." Before diving into individual queries, build the full inventory first, or you'll optimize a subset and miss the rest.

Find the team's owned code — products/*/product.yaml is the source of truth, CODEOWNERS is the backup. Check the two places in this order:

products/*/product.yaml (start here). Each product declares its owning team(s) under owners: as a bare slug — the team's CODEOWNERS handle minus the @PostHog/ prefix (conversations, logs, team-signals, …). Grep every products/*/product.yaml for the team's slug; each hit means the team owns all of products/<name>/**. This is exactly what the reviewer auto-assigner uses, and one team often owns several products — don't stop at the first match. A team whose code all lives under products/ may have nothing in CODEOWNERS, and that's correct, not a gap.
.github/CODEOWNERS-soft (backup, for paths outside products/). Grep for the team's handle (@PostHog/team-surveys, @PostHog/conversations, …). This file covers what product.yaml can't: shared backend (posthog/tasks/, posthog/hogql/database/schema/, posthog/temporal/), frontend scenes not yet moved into a product, and sub-folder overrides. (.github/CODEOWNERS — the hard, blocking file — is mostly infrastructure and rarely names product paths, but a grep there is cheap.)

If the team name you were given doesn't resolve, try both the bare and team--prefixed slug (conversations vs team-conversations) — the convention isn't uniform.

Verify the CODEOWNERS-soft paths exist — they drift; product.yaml doesn't. product.yaml ownership is self-consistent: it sits inside the directory it owns, so products/<name>/** always exists. The staleness is in CODEOWNERS-soft, where a product that moved under products/ often leaves its old paths behind. Check each CODEOWNERS-soft path on disk; for any that's gone, the code most likely relocated into a products/<name>/ the team already owns via product.yaml — flag the stale entry to the operator so they can fix CODEOWNERS-soft, don't silently substitute and move on. A stale path you skip is a query you never optimized.

Search both backend AND frontend — owned paths include both. A team's ownership almost always spans frontend/src/... as well as backend Python. The majority of ClickHouse/HogQL queries are written in Python (query runners, execute_hogql_query, raw sync_execute), but not all of them — plenty of products still build HogQL client-side in kea logics / React / TypeScript and POST it to the /query endpoint. If you only search backend paths and backend idioms, you'll miss these entirely (this is a real, recurring miss). When scoping a team, either search every owned path — frontend included — with both backend and frontend query idioms, or tell the operator up front that you're covering backend only and ask whether they want frontend too. Don't let an unstated "queries live in the backend" assumption narrow the search silently.

Frontend HogQL doesn't look like the backend patterns in Step 0. Grep the team's frontend/ paths for these too:

api.queryHogQL(...), HogQLQueryString, the hogql`...` tagged template
NodeKind.HogQLQuery / kind: 'HogQLQuery' objects with a query: string
structured query nodes that compile to ClickHouse: DataTableNode, EventsQuery, TrendsQuery/InsightVizNode, and PropertyFilterType.HogQL expressions inside them
string literals with SELECT ... FROM events, or product-specific markers (event names like 'survey sent', property keys like $survey_id)

The same logical query is sometimes implemented twice — once in a backend query runner / endpoint and once as frontend-built HogQL (often a stalled frontend→backend migration). Treat both copies as in-scope, and note the duplication: a printer- or function-level fix on the backend won't reach a hand-built frontend string emitting the same SQL.

Step 0: confirm you're at the right layer

Before walking through the workflow, check that the slow query in front of you actually goes to ClickHouse via HogQL. The fastest way is to look at how the query is built:

What you see	Where it goes	What to use
`execute_hogql_query(...)`, `HogQLQuery`, a `*QueryRunner` subclass, an insight, a HogQL `.ambr` snapshot	ClickHouse via the HogQL printer	This skill
`sync_execute(...)`, `client.execute(...)`, hand-written `SELECT ... FROM events`, ClickHouse migration	ClickHouse directly (no HogQL)	This skill (steps 2-5; skip step 1)
`Model.objects.filter(...)`, `Model.objects.raw(...)`, a Django queryset, a `RawSQL` over the app DB	Postgres via Django ORM	Not this skill. Read `docs/published/handbook/engineering/databases/query-performance-optimization.md` (`## PostgreSQL` section), and use pganalyze / AWS RDS Performance Insights for production diagnostics
`personhog_client.`, `get_personhog_client()`, `get_person_by_` helpers	personhog (gRPC, Postgres-backed)	Not this skill. See `posthog/personhog_client/README.md`

If the file you were pointed at is a coordinator, orchestrator, Celery task, Temporal workflow, or management command, the actual ClickHouse-touching query is often one layer further in, in the activity / child workflow / inner function the coordinator dispatches. Follow the dispatch (e.g. the class passed to start_child_workflow, the activity decorated with @temporalio.activity.defn, the function called inside apply_async) until you find the layer that builds the HogQL query. Some files mix both (Postgres queries to pick work + HogQL queries to do it); treat them as two separate optimization tasks if both are slow.

If the slow query turns out to be raw ClickHouse SQL embedded in production code (Python f-strings, string SQL passed to sync_execute, client.execute, client.read_query, etc., not the output of the HogQL printer), flag this to the user up front, then continue with the optimization. HogQL queries get materialized-column substitution, property-group dispatch, lazy joins, team-id guards, and a pile of other optimizations automatically through the printer; raw SQL has to reimplement each of those or live without them. The structural fix is usually to express the query in HogQL and let the printer handle these consistently, but that's a larger change; walk through the rest of this skill normally so the user has both options (local fix now, HogQL move later, or both).

Single-team vs multi-team is the deciding factor for whether raw SQL is even excusable. execute_hogql_query is team-scoped — it always runs against exactly one team and injects the team_id guard for you. So:

A raw query scoped to one team should almost always be HogQL. If the SQL has (or should have) a team_id = X filter and reads a single team's data, there is no reason it's hand-written — it's leaving materialized-column substitution, lazy joins, and the team-id guard on the table. Treat raw single-team sync_execute as a smell in its own right and recommend the HogQL move, not just a local tweak. The clearest tell is a query that hand-rolls a materialized-column lookup (e.g. calling get_materialized_column_for_property(...) with a JSONExtract fallback) — that is the printer's job, reimplemented by hand because the query never goes through it.
A query that legitimately spans multiple teams is exempt. Cross-team / global jobs — periodic enrichment, billing rollups, "find every team where X" scans with no team_id filter or a team_id IN (...) over many teams — can't go through execute_hogql_query, since it scopes to a single team. These are a reasonable place for raw sync_execute. Don't push them toward HogQL; just optimize the raw SQL in place (materialized columns, sort-key prefix, etc.). When such a query does its own materialized-column lookup, that's expected, not a smell.

So when you find raw SQL, first ask "is this one team or many?" One team → recommend HogQL. Many teams → keep it raw and optimize the SQL directly.

INSERT is not an escape hatch from HogQL — the single-team rule still applies to the read half. HogQL has no INSERT statement (intentional design choice), so only the envelope has to be hand-built. The recommended pattern is to construct the SELECT in HogQL, print it, and concatenate it into INSERT INTO <table> <printed_select>. The read half still gets materialization, lazy joins, team-id guards, and everything else the printer does; only the INSERT wrapper is a hand-built string. So a single-team INSERT ... SELECT whose SELECT is a hand-written string (raw column lists, JSONExtract over properties, a manual team_id filter) is the same smell as any other single-team raw query: flag it, and recommend moving the SELECT half to a HogQL-printed query while keeping the INSERT INTO <table> wrapper raw. The fact that the surrounding statement is an INSERT does not make the read exempt. The only genuinely raw-SQL-fine cases are: INSERT ... VALUES of explicit rows assembled in Python (no SELECT to print), multi-team INSERT ... SELECT (same exemption as any multi-team query), ClickHouse migrations, and one-shot operational scripts.

Background: read these once

Start with the handbook context for the conceptual model (these files are the source the public posthog.com/handbook pages are built from):

query-performance-optimization.md covers how to find and fix slow queries across both stores.
hogql-python.md explains the printer pipeline and how to drive HogQL programmatically.
clickhouse-queries-new-products.md is the authoritative guide for designing tables and writing query runners; useful even when you're tuning existing ones.

Then read the table schema files so you know what columns, sort keys, partition keys, and skip indexes already exist. Do not read them line-by-line; skim for ORDER BY, PARTITION BY, INDEX, and materialized column declarations.

Events: posthog/models/event/sql.py
Sessions (v3): posthog/models/raw_sessions/sessions_v3.py (v2 still exists at sessions_v2.py)
Persons: posthog/models/person/sql.py
Person overrides: posthog/models/person_overrides/sql.py
Cohorts (ClickHouse cohortpeople, the materialized membership table): posthog/models/cohort/sql.py. The Postgres Cohort model that stores the cohort definition lives at posthog/models/cohort/cohort.py; queries against that one are a Postgres concern, see step 0.

If the query you're optimizing hits a table not in this list (e.g. precalculated_events, app_metrics2, session_replay_events, log_entries, heatmaps, a product-specific table), find its schema under posthog/models/*/sql.py or in the posthog/clickhouse/migrations/ that created it. Same drill: skim for ORDER BY, PARTITION BY, INDEX, and materialized columns.

And the HogQL side, so you know where to make a change:

HogQL query entry point: posthog/hogql/query.py
ClickHouse printer: posthog/hogql/printer/clickhouse.py
Base printer (shared visit methods): posthog/hogql/printer/base.py
HogQL → SQL helpers: posthog/hogql/printer/utils.py
HogQL function registry: posthog/hogql/functions/ (aggregations are in aggregations.py)
HogQL database schema mappings: posthog/hogql/database/schema/

The materialization system (how property access gets rewritten away from JSONExtract automatically):

Materialized columns registry: ee/clickhouse/materialized_columns/columns.py. get_materialized_columns(table) / get_enabled_materialized_columns(table) return the (property_name, table_column) → MaterializedColumn map for a given table, cached for 15 minutes against the connected ClickHouse.
Property groups (different column strategy, same registry shape): posthog/clickhouse/property_groups.py.
Printer swap point: _get_materialized_property_source_for_property_type() and visit_property_type() in posthog/hogql/printer/base.py (around lines 1260 and 1354). When the printer visits a property access, it asks the registry what's materialized for the current ClickHouse and emits the best form available (direct column read, property group lookup, or fall back to JSONExtract).
ClickHouse-specific dispatch: posthog/hogql/printer/clickhouse.py _get_materialized_property_source_for_property_type override (around line 412).

This is the mechanism behind the test-vs-prod caveat in the JSON smell below: the printer's lookup runs against whatever ClickHouse it's connected to. The test fixture has a sparse set, prod has a dense set, and the same HogQL prints to different SQL in each. By default, assume property access in HogQL queries gets materialized. The exception is hand-written SQL strings in product code that never go through the printer (e.g. temporal activities, migrations, sync_execute callers) and have to do their own materialized-column lookup.

The cluster topology (shards, replicas, ingestion vs data nodes) is in posthog/clickhouse/migrations/CLAUDE.md. Read that before proposing any migration that has to land safely across nodes.

Step 1: get the ClickHouse SQL

Already have the SQL? If you're optimizing a hand-written ClickHouse query (a string passed to sync_execute, client.execute, client.read_query, or sitting inside a migration / activity), you already have the SQL in front of you. Skip to step 2.

For HogQL queries, three ways to get from HogQL to executable ClickHouse SQL; pick whichever is cheapest for the situation:

From Python, call execute_hogql_query() in posthog/hogql/query.py and read response.clickhouse. If you only want the printed SQL without executing, call prepare_and_print_ast(..., dialect="clickhouse") from posthog/hogql/printer/utils.py. For an AST you've already prepared, print_prepared_ast is the lower-level entry point.

From snapshot tests, the .ambr files under posthog/hogql_queries/test/__snapshots__/ (and equivalent test dirs in each product) contain the ClickHouse SQL generated for representative inputs. Search for a snapshot that resembles your shape and read the printed query directly.

From production, the slowest real example is usually more informative than anything you can synthesize. Use /query-clickhouse-via-metabase to hit clusterAllReplicas(posthog, system, query_log) (or posthog.query_log_archive for anything older than ~4 hours; the archive table holds ~22 days). Filter for is_initial_query to avoid the per-shard duplicates ClickHouse logs for every query, plus the usual type = 'QueryFinish' and query_duration_ms > <threshold> filters. query_log_archive has typed lc_* columns (lower-cardinality strings), so prefer it when you can.

Step 2: scan for the common smells

Before reaching for tools, eyeball the SQL for the patterns that account for most slow ClickHouse queries.

The smells below are the view from the SQL: shapes that are bad on sight. When you instead have a specific slow query (usually pulled from production) and need to work backwards from its runtime cost to the cause, references/investigation-playbook.md is the deep dive: pulling the full query, reading bytes vs CPU vs duration, the fuller list of runtime causes (high-cardinality breakdowns, function-wrapped sort keys that defeat granule pruning, ratio-metric double scans), tracing a query back to the product code that issued it, and using EXPLAIN to confirm a hypothesis.

`FROM <table> FINAL`

FROM person FINAL, FROM groups FINAL, FROM cohortpeople FINAL, or any other FINAL on a ReplacingMergeTree / CollapsingMergeTree / AggregatingMergeTree table forces ClickHouse to run an on-the-fly merge across every part it reads, deduplicating to the latest version per sort-key row. It defeats parallel reads, blows up memory, and scales badly with part count. On large tables (person, anything sharded) it is rarely the right answer.

Common rewrites:

argMax per row. Replace SELECT properties FROM person FINAL WHERE team_id = X AND id IN (...) with SELECT argMax(properties, version) FROM person WHERE team_id = X AND id IN (...) GROUP BY id. You get the latest properties without the merge.
LIMIT 1 BY with ORDER BY version DESC. Useful when you want a row per group and the table has a monotonically increasing version column.
Filter before FINAL. If you genuinely need FINAL (rare), make sure the WHERE clause is selective enough on the sort-key prefix that ClickHouse only has to FINAL a small slice of parts.

Worth a mention specific to PostHog: per CLAUDE.md, new code that needs person/group data should go through personhog (get_personhog_client), not raw ClickHouse queries against person / groups. If you find yourself optimizing a raw FROM person FINAL in new code, the right fix is often to move to personhog rather than tune the query.

JSON operations on properties

Any JSONExtractString(properties, ...), JSONExtractFloat(properties, ...), JSONHas(properties, ...), or similar against the raw properties / person_properties / group_properties column is a huge smell. It means ClickHouse has to parse the JSON blob at query time for every row it reads. This holds for both event and person property blobs: reading either as raw JSON can be up to ~100x slower than reading a directly materialized (mat_* / dmat_*) column, and ~10x slower than a property group read.

For any query that goes through the HogQL printer, the fix is mechanical and unconditional: replace every hand-written JSONExtract*(properties, 'X') with HogQL property access properties.X (wrap in toFloat(...) / toInt(...) when you need a non-string type). Convert all of them — do not stop to work out which properties are materialized. The printer path is most HogQL: backend parse_select / execute_hogql_query / *QueryRunner queries, and frontend api.queryHogQL / hogql`...` strings (they POST to /query, which runs the same printer). When the printer visits properties.X it does the materialization lookup against the live ClickHouse and emits the best available form — a directly materialized column, a property group read, a DMAT slot, or a JSONExtract fallback when nothing is materialized. So properties.X is never worse than the hand-written JSONExtract: worst case the printer emits the same JSONExtract; best case you get the materialized fast path, both now and automatically in the future when the column later gets materialized.

Do not try to determine which properties are materialized and convert only those. Reading migration files, the materialized-columns registry, or a DESCRIBE to decide which JSONExtracts are "safe" to convert is the printer's job reimplemented by hand, and it reaches the wrong answer:

Materialization is frequently not created by a migration at all, so scanning migrations misses most of it.
Property groups cover properties that have no dedicated materialized column.
The materialized set differs per environment and changes over time, so any answer you compute is a snapshot that goes stale.

Converting only the subset you could confirm leaves the query inconsistent (some properties as properties.X, sibling properties left as JSONExtract, sometimes forced into a hogql.raw() conditional to keep one of each) for zero benefit, and silently skips every property you couldn't find evidence for. Convert all of them and let the printer decide at print time.

The one exception is raw SQL that never goes through the printer — multi-team sync_execute queries, ClickHouse migrations, and temporal activities that build query strings by hand (see Step 0). There is no printer to do the lookup (and properties.X HogQL syntax isn't available), so those queries do have to reference the materialized column directly. That hand-rolled lookup is correct there, and only there.

For the raw-SQL exception, and as background, we have three materialization strategies. You do not need to consult these to convert a printer-path query:

Directly materialized columns: posthog/clickhouse/materialized_columns.py
Property groups: posthog/clickhouse/property_groups.py
Dynamic materialized columns (DMAT slots, recent): posthog/models/dmat_slot_assignments/ and EVENTS_TABLE_DYNAMICALLY_MATERIALIZED_COLUMNS() in posthog/models/event/sql.py

We are also experimenting with the new ClickHouse JSON data type. Check recent migrations under posthog/clickhouse/migrations/ for the current state.

In test code, if you need a property materialized for the duration of a test (e.g. to assert the printer emits a column read or that a skip index is used), wrap the block in the materialized() context manager from posthog/test/base.py (search for def materialized). It supports create_minmax_index, create_bloom_filter_index, and the lower-case variants.

A JSONExtract you see in a .ambr snapshot or other printed SQL — where the source query already uses properties.X — is just the printer's fallback because the test fixture lacks the materialized column prod has. There is nothing to change in the source, and it is not evidence of a production problem: prod may well emit a materialized column read for the same query. Don't "fix" the snapshot, and don't treat it as a perf bug.

Primary key and skip indexes

Look at the ORDER BY of the table the query reads from, and check the WHERE clause covers a prefix. The events table sort key is (team_id, toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid)), so any non-trivial events query should filter on both timestamp and event unless there is a documented reason not to (e.g. cohort calculation that legitimately needs all events).

For skip indexes, the test helpers get_index_from_explain and get_indexes_from_explain in posthog/test/base.py run EXPLAIN PLAN indexes=1, json=1 against the generated SQL and let you assert that a specific index is being considered. Add one of these to a test when you depend on a skip index for performance, otherwise a future change to the printer can silently undo your optimization.

When you find the index isn't being used, common causes are: a nullIf or similar wrapping the materialized column that hides it from the planner, a comparison being printed against a stringified NULL, or the materialized column not existing in the test fixtures (use materialized(..., create_minmax_index=True) etc.).

Self-joins on events

Joining the events table to itself (or any large table to itself) is almost always wrong. Two passes over events is twice the work; with a join predicate you also lose the primary key ordering.

Rewrite to one pass plus conditional aggregation: sumIf(amount, event = 'purchase'), uniqIf(distinct_id, event = 'pageview'), uniqMapIf(properties.utm, 1, event = 'session_start'), etc. If you need correlated rows (e.g. "first event in session before a conversion"), arrayFilter / arrayFirst / window functions over an ordered groupArray are usually faster than a self-join.

If HogQL doesn't expose the conditional aggregation function you need, add it to posthog/hogql/functions/aggregations.py.

CTEs

ClickHouse CTEs (the WITH name AS (SELECT ...) form, not the scalar WITH x AS 1 form) are inlined into the query, not materialized. A CTE referenced twice is executed twice. A CTE referenced inside another CTE that is itself referenced N times multiplies out. This is the single most common cause of "the planner is doing something weird".

Until materialized CTEs ship in our ClickHouse version (check the latest CH release notes for WITH ... AS MATERIALIZED), the workaround is the same as the self-join case: rewrite to a single pass with conditional aggregation, or materialize the intermediate result yourself via a subquery in FROM that ClickHouse is more likely to execute once.

Step 3: run EXPLAIN

ClickHouse EXPLAIN works on a dev instance even without representative data, because most of the planner output (indexes considered, query tree, pipeline) does not need rows to exist. Useful flavors:

EXPLAIN PLAN indexes=1, actions=1, json=1 SELECT ... for primary key and skip index use
EXPLAIN QUERY TREE SELECT ... for the post-analyzer logical tree
EXPLAIN PIPELINE SELECT ... for the processor-level pipeline
EXPLAIN ESTIMATE SELECT ... for per-part row/mark estimates
EXPLAIN SYNTAX SELECT ... for the normalized SQL after parsing

See the ClickHouse EXPLAIN docs for the full option matrix. For the hypothesis-testing technique — EXPLAINing the suspect query and a fixed variant side by side and diffing Granules, ReadType, and Prewhere-vs-primary-key — plus which variants do a small metadata read rather than being entirely free, see references/investigation-playbook.md.

Step 4: measure for real

EXPLAIN tells you what the planner intends. To know whether your rewrite is actually faster, you need to run both versions against representative data and compare.

Local ClickHouse

hogli dev:demo-data (or python manage.py generate_demo_data --help for the underlying flags) seeds the dev ClickHouse with a synthetic dataset. hogli db:ch opens a clickhouse-client against it. Local data is small enough that wall-clock measurements are dominated by noise, so use it for correctness and for EXPLAIN; trust the Test Cluster for timing.

You can experiment locally with new skip indexes, materialized columns, or other schema changes, since local ClickHouse is a single node. Ask the user before adding anything, and remember production is multi-node, so any structural change has to round-trip through /clickhouse-migrations before it lands. After adding a skip index, ALTER TABLE ... MATERIALIZE INDEX ... to build it over existing data.

A useful local proxy for query work is the bytes-read figure (SELECT ... FORMAT JSON includes it, as does system.query_log locally), which is much less noisy than elapsed time.

Test Cluster

The Test Cluster is a Metabase-fronted ClickHouse with a snapshot of team 2's data, read-only, with no noisy-neighbor interference from production. It is the right place for actual timing measurements. Go through /query-clickhouse-via-metabase to authenticate and submit queries.

Because it only has team 2 data, you will need to rewrite the production query before running it. At minimum, swap the team_id and pick a date range that overlaps the snapshot. If the original query depends on a custom property that team 2 doesn't use, or a PostHog feature that team 2 doesn't have configured, you may need to substitute or skip that branch; this is a judgement call.

Apply the cluster's materialized columns before measuring. If the query you're porting came from a .ambr snapshot or a local test, it almost certainly references properties via JSONExtract because the test fixture lacks the materialized columns prod has. The Test Cluster mirrors prod's schema, so DESCRIBE <table> will list real pmat_* (events) or pmat_* / mat_* (persons, groups) columns. Swap your JSONExtract(properties, 'X', ...) calls for the corresponding materialized column reference before timing the query. Skipping this step means you're measuring a query shape that the printer would never actually emit in production, and your numbers won't transfer.

For measurement, set SETTINGS use_uncompressed_cache=0 (mirrors what ee/benchmarks/measure.sh does) and take the median of 5 runs. Pull the actual numbers from system.query_log on the Test Cluster, not from how long the Metabase request took, since the API path adds a fixed floor and per-request jitter that swamps the metric you care about. query_log also gives you read_rows, read_bytes, memory_usage, and the ProfileEvents map, all of which are more diagnostic than wall time when you're comparing two versions.

Before suggesting an optimization, measure it on the Test Cluster. If you're proposing a rewrite (dropping FINAL, swapping a CTE for conditional aggregation, materializing a column, restructuring a join), run both the original and your candidate against the same team-2-adapted query and report the before/after query_duration_ms, read_bytes, and memory_usage from system.query_log. A suggestion without numbers is a guess. If you couldn't measure (Test Cluster unavailable, query doesn't adapt cleanly to team 2's data, the change is a schema-only optimization the read-only cluster can't host), say so explicitly when you make the suggestion rather than implying the change will be faster in production.

The Test Cluster is read-only, so you cannot try schema changes there. For those, prototype locally, then write the migration and have it reviewed.

Query performance autoresearch (the powertool)

For hard cases, tools/query-performance-ai/query_performance_ai/ wraps pi-autoresearch in a coordinator that hands the LLM a query and asks it to optimize against the Test Cluster in a loop. Setup is non-trivial (Docker sandbox per query, ANTHROPIC_API_KEY, Metabase database IDs), so ask the user to run the setup themselves:

Coordinator and setup: tools/query-performance-ai/README.md
Coordinator entry point: tools/query-performance-ai/query_performance_ai/orchestrator/coordinator.py
The campaign skill the sandboxed agent runs: tools/query-performance-ai/query_performance_ai/sandboxed_autoresearch_agent/pi_plugin/skills/clickhouse-autoresearch-campaign/SKILL.md
Orchestration contract: tools/query-performance-ai/query_performance_ai/sandboxed_autoresearch_agent/pi_plugin/skills/clickhouse-autoresearch-campaign/orchestration.md

I am not entirely sure of the exact commands needed beyond the snippets in the README. Ask the user; the setup also takes effort on their side.

Step 5: apply the optimization

Once you have a faster ClickHouse SQL, you need to make HogQL emit it. Pick the lowest-blast-radius layer that gets the job done:

HogQL query / query runner change is the cheapest. If the rewrite can be expressed as a different HogQL query (different aggregation, different join order, swapping a CTE for conditional aggregation), change the query runner under posthog/hogql_queries/ or products/*/backend/. Snapshot the new ClickHouse output via the relevant .ambr test.

New HogQL function if the rewrite needs a conditional aggregation or other ClickHouse function HogQL doesn't expose yet. Add it to posthog/hogql/functions/aggregations.py (or the appropriate file under posthog/hogql/functions/) with HogQLFunctionMeta(name, min_args, max_args, aggregate=True) and the query runner can use it like any other function.

HogQL printer change when the optimization is a SQL-level rewrite the printer should apply automatically. The ClickHouse printer at posthog/hogql/printer/clickhouse.py already does several of these; _get_optimized_materialized_column_equals_operation (around line 574) is a good template for a comparison-rewrite optimization. Add a snapshot test and a get_index_from_explain assertion so the optimization can't silently regress.

ClickHouse migration for schema changes (new skip index, new materialized column, projection, table engine change). Use /clickhouse-migrations for the patterns. Remember production is multi-shard, multi-replica with separate data and ingestion node roles, so node_roles=[...], sharded=True, and is_alter_on_replicated_table=True matter; never use ON CLUSTER.

posthog/clickhouse/migrations/0250_property_values_lowercase_text_index.py is a clean example of adding a skip index on a replicated table and materializing it.

A note on team-specific heuristics

Some rewrites help one team and hurt another. A funnels optimization we tried was great when there was a large drop-off between the first and second step (small intermediate set, cheap to enumerate), but slower when the first step matched almost every event (huge set, expensive to enumerate).

When you notice this kind of asymmetry, suggest a heuristic to the user rather than implementing it yourself: count the events for each step in the relevant time window, and only apply the optimization when the ratio is favorable. The shape of the heuristic depends on the rewrite and the team, so this is a design decision for the user, not something to commit speculatively.

Test discipline

Whenever you change a printer rule, a query runner, or add a HogQL function, snapshot the generated ClickHouse SQL in .ambr and add an EXPLAIN-based assertion if the optimization depends on a specific index or rewrite. A passing-after-fix test isn't proof the test would have failed before; flip the change off briefly to confirm the test was actually exercising your code path.

Learnings log

references/learnings.md collects case studies and surprising findings from past optimization sessions, especially ones where the rule-of-thumb in this skill turned out to be wrong or needed nuance. Read it before relying heavily on one of the smell descriptions, and append a new entry when you measure something worth remembering.

Do not paste customer data into entries. This file is checked into the public OSS repo. No raw person / group / distinct_id values, no custom property names or values, no team or org names, no row samples, no precise operational scale (exact row counts, customer-specific durations). Use placeholders (<bound_uuid>, <custom_property>, <team_id>) or describe the shape (a 1M-person slice, tens of millions of rows). PostHog's own team 2 is fine to name as the canonical Test Cluster target; redact other team IDs.