altinity-expert-clickhouse-errors

star 38

Investigate ClickHouse query failures, exceptions, crashes, and error patterns. Use for error analysis and failure investigation.

ntk148v By ntk148v schedule Updated 1/20/2026

name: altinity-expert-clickhouse-errors description: Investigate ClickHouse query failures, exceptions, crashes, and error patterns. Use for error analysis and failure investigation.

Error Investigation and Exception Analysis

Investigate query failures, exceptions, crashes, and error patterns.


Quick Diagnostics

1. Recent Errors Summary

select
    toStartOfHour(event_time) as hour,
    count() as error_count,
    uniq(exception_code) as unique_errors,
    uniq(user) as users_affected
from system.query_log
where type like 'Exception%'
  and event_date = today()
group by hour
order by hour desc

2. Error Distribution by Code

select
    exception_code,
    count() as occurrences,
    any(substring(exception, 1, 100)) as example_message,
    min(event_time) as first_seen,
    max(event_time) as last_seen
from system.query_log
where type like 'Exception%'
  and event_date >= today() - 1
group by exception_code
order by occurrences desc
limit 30

3. Recent Exceptions Detail

select
    event_time,
    user,
    exception_code,
    substring(exception, 1, 200) as exception,
    query_kind,
    substring(query, 1, 100) as query_preview
from system.query_log
where type like 'Exception%'
  and event_date = today()
order by event_time desc
limit 50

Common Error Analysis

Memory Errors (Code 241)

select
    event_time,
    user,
    formatReadableSize(memory_usage) as memory_at_failure,
    formatReadableSize(read_bytes) as read_bytes,
    substring(exception, 1, 150) as exception,
    substring(query, 1, 80) as query_preview
from system.query_log
where type like 'Exception%'
  and exception_code = 241  -- MEMORY_LIMIT_EXCEEDED
  and event_date >= today() - 1
order by event_time desc
limit 30

Solutions: See altinity-expert-clickhouse-memory for memory optimization.

Too Many Parts (Code 252)

select
    event_time,
    user,
    arrayStringConcat(tables, ', ') as tables,
    substring(exception, 1, 150) as exception
from system.query_log
where type like 'Exception%'
  and exception_code = 252  -- TOO_MANY_PARTS
  and event_date >= today() - 1
order by event_time desc
limit 30

Solutions: See altinity-expert-clickhouse-merges and altinity-expert-clickhouse-ingestion for part management.

Timeout Errors (Code 159)

select
    event_time,
    user,
    query_duration_ms,
    formatReadableSize(read_bytes) as read_bytes,
    substring(query, 1, 100) as query_preview
from system.query_log
where type like 'Exception%'
  and exception_code = 159  -- TIMEOUT_EXCEEDED
  and event_date >= today() - 1
order by event_time desc
limit 30

Table/Column Not Found (Codes 60, 16)

select
    event_time,
    user,
    exception_code,
    substring(exception, 1, 150) as exception,
    substring(query, 1, 100) as query_preview
from system.query_log
where type like 'Exception%'
  and exception_code in (60, 16)  -- TABLE_DOESNT_EXIST, NO_SUCH_COLUMN_IN_TABLE
  and event_date = today()
order by event_time desc
limit 30

Crash Analysis

Recent Crashes

select
    event_time,
    signal,
    thread_id,
    query_id,
    substring(trace_full, 1, 500) as stack_trace,
    substring(query, 1, 100) as query_preview
from system.crash_log
where event_time > now() - interval 7 day
order by event_time desc
limit 20

Crash Summary

select
    toDate(event_time) as day,
    count() as crashes,
    groupUniqArray(signal) as signals
from system.crash_log
where event_time > now() - interval 30 day
group by day
order by day desc

Error Patterns by User/Client

Errors by User

select
    user,
    count() as errors,
    groupUniqArray(exception_code) as error_codes,
    max(event_time) as last_error
from system.query_log
where type like 'Exception%'
  and event_date = today()
group by user
order by errors desc

Errors by Client

select
    client_hostname,
    client_name,
    count() as errors,
    groupUniqArray(exception_code) as error_codes
from system.query_log
where type like 'Exception%'
  and event_date = today()
group by client_hostname, client_name
order by errors desc
limit 20

Error Log from text_log

Critical/Error Level Messages

select
    event_time,
    level,
    logger_name,
    substring(message, 1, 200) as message
from system.text_log
where level in ('Fatal', 'Critical', 'Error')
  and event_time > now() - interval 1 hour
order by event_time desc
limit 50

Errors by Component

select
    logger_name,
    count() as errors,
    max(event_time) as last_seen
from system.text_log
where level in ('Fatal', 'Critical', 'Error')
  and event_time > now() - interval 24 hour
group by logger_name
order by errors desc
limit 30

System Warnings

select message as warning
from system.warnings

Error Code Reference

Code Name Common Cause
60 TABLE_DOESNT_EXIST Wrong table name or database
62 SYNTAX_ERROR Invalid SQL
159 TIMEOUT_EXCEEDED Query too slow
241 MEMORY_LIMIT_EXCEEDED Query uses too much RAM
252 TOO_MANY_PARTS Insert too fast, merges behind
319 UNKNOWN_PACKET Network/client issues
341 UNFINISHED Operation interrupted
16 NO_SUCH_COLUMN Column doesn't exist
36 CANNOT_READ_ALL_DATA Corruption or network
164 READONLY Replica in readonly mode
242 TABLE_IS_READ_ONLY Table locked
243 TABLE_IS_DROPPED Concurrent DROP
254 RECEIVED_ERROR_FROM_REMOTE Distributed query failure

Distributed Query Errors

select
    event_time,
    initial_query_id,
    exception_code,
    substring(exception, 1, 150) as exception,
    substring(query, 1, 80) as query_preview
from system.query_log
where type like 'Exception%'
  and event_date = today()
  and (exception_code = 254 or query ilike '%distributed%')
order by event_time desc
limit 30

Error Rate Analysis

Error Rate Over Time

select
    toStartOfFiveMinutes(event_time) as ts,
    count() as total_queries,
    countIf(type like 'Exception%') as errors,
    round(100.0 * countIf(type like 'Exception%') / count(), 2) as error_rate_pct
from system.query_log
where event_time > now() - interval 6 hour
group by ts
order by ts desc

Error Rate by Query Type

select
    query_kind,
    count() as total,
    countIf(type like 'Exception%') as errors,
    round(100.0 * countIf(type like 'Exception%') / count(), 2) as error_rate_pct
from system.query_log
where event_date = today()
group by query_kind
order by errors desc

Ad-Hoc Query Guidelines

Required Safeguards

-- Time bound
where event_date >= today() - 1
-- or
where event_time > now() - interval 1 hour

-- Limit results
limit 100

Key Tables

  • system.query_log (type like 'Exception%') - Query failures
  • system.text_log (level in Error/Critical/Fatal) - Server errors
  • system.crash_log - Server crashes
  • system.warnings - Active warnings

Cross-Module Triggers

Finding Load Module Reason
Memory errors altinity-expert-clickhouse-memory Memory analysis
TOO_MANY_PARTS altinity-expert-clickhouse-merges, altinity-expert-clickhouse-ingestion Part management
Replication errors altinity-expert-clickhouse-replication Replica status
Distributed errors altinity-expert-clickhouse-replication Cluster health
Unknown errors altinity-expert-clickhouse-text-log Deep log analysis
Crashes altinity-expert-clickhouse-text-log Stack trace analysis

Alerting Recommendations

Set alerts for:

  • Error rate > 5%
  • Memory errors > 10/hour
  • Any crashes
  • Readonly replica errors
  • TOO_MANY_PARTS errors
Install via CLI
npx skills add https://github.com/ntk148v/clicklens --skill altinity-expert-clickhouse-errors
Repository Details
star Stars 38
call_split Forks 6
navigation Branch main
article Path SKILL.md
More from Creator