name: redis-cache-strategy description: > Redis caching strategy designer and reviewer. ALWAYS use when designing, reviewing, or troubleshooting Redis caching layers — cache pattern selection (cache-aside, write-through, write-behind), TTL strategy, cache stampede/penetration/avalanche prevention, hot key handling, cache-DB consistency, distributed locking, key naming, and degradation design. Use even for "just add a cache" requests — cache invalidation is one of the two hard problems in computer science, and a naive implementation creates subtle consistency bugs that surface only under load.
Redis Cache Strategy Review
Quick Reference
| If you need to… | Go to |
|---|---|
| Understand what this skill covers | §1 Scope |
| Check mandatory prerequisites | §2 Mandatory Gates |
| Choose review depth | §3 Depth Selection |
| Handle incomplete context | §4 Degradation Modes |
| Evaluate cache design item by item | §5 Cache Strategy Checklist |
| Choose the right cache pattern | §6 Pattern Selection |
| Avoid common caching mistakes | §7 Anti-Examples |
| Score the review result | §8 Scorecard |
| Format review output | §9 Output Contract |
| Deep-dive cache patterns | references/cache-patterns.md |
| Understand failure mode defenses | references/cache-failure-modes.md |
§1 Scope
In scope — Redis caching strategy for production backend services:
- Cache pattern selection (cache-aside, write-through, write-behind, dual-write debounce)
- Key naming conventions and namespace design
- TTL strategy (expiration, jitter, eviction policy alignment)
- Cache failure modes (stampede/penetration/avalanche) and defenses
- Hot key detection and mitigation (singleflight, local cache, sharding)
- Cache-DB consistency design and staleness SLA
- Distributed locking patterns (SETNX, Redlock, lock timeout)
- Cache warmup and cold-start strategies
- Degradation design (cache-down fallback)
Out of scope — delegate to dedicated skills:
- Redis cluster topology, persistence (RDB/AOF), replication config →
redis-best-practise - Application code changes →
go-code-revieweror language-specific reviewer - Security hardening, ACL, TLS →
redis-best-practise
§2 Mandatory Gates
Execute gates sequentially. Each gate has a STOP condition.
Gate 1: Context Collection
| Item | Why it matters | If unknown |
|---|---|---|
| Redis version (6.x / 7.x) | Feature availability (e.g., client-side caching in 6.0+) | Assume 6.0 |
| Deployment mode (standalone / sentinel / cluster) | Affects key distribution, Lua atomicity scope, lock patterns | Assume standalone |
| maxmemory + eviction policy | Determines what happens when cache is full | Ask; critical for correctness |
| Cache role in architecture | Primary cache? L1/L2? Read-through proxy? | Must clarify before design |
| Data source type | SQL DB / NoSQL / external API — affects consistency patterns | Must clarify |
| Read:write ratio | Drives pattern selection (read-heavy → cache-aside; write-heavy → write-behind) | Assume read-heavy |
| Consistency requirement | Eventual (seconds)? Strong? Best-effort? | Must clarify |
| Peak QPS on cached entities | Determines stampede/hot-key risk | Assume high if unknown |
STOP: Cannot determine what the cache is caching (no data source, no access pattern). Clarify before proceeding.
PROCEED: At least data source, cache role, and consistency requirement are known or assumed.
Gate 2: Scope Classification
| Mode | Trigger | Output |
|---|---|---|
| review | User provides existing caching code/config | Safety analysis with findings |
| design | User describes what they want to cache | Complete cache strategy proposal |
| troubleshoot | User reports cache-related issues (stale data, stampede, latency) | Root cause + fix plan |
STOP: Request is not cache-related (e.g., Redis Streams pipeline, pub/sub messaging). Redirect to redis-best-practise.
PROCEED: Caching intent confirmed.
Gate 3: Risk Classification
| Risk | Definition | Required action |
|---|---|---|
| SAFE | Standard cache-aside with TTL, read-heavy workload | Standard review |
| WARN | Distributed lock usage, write-behind pattern, multi-service cache sharing | Off-peak rollout + monitoring |
| UNSAFE | Cache as sole data source (no DB backing), or cache-DB consistency SLA < 1s | Architecture review + fallback design mandatory |
STOP: Any UNSAFE item without fallback design.
PROCEED: Every cache component has risk level and mitigation.
Gate 4: Output Completeness
Before delivering output, verify all §9 Output Contract sections present. §9.9 Uncovered Risks must never be empty.
§3 Depth Selection
| Depth | When to use | Gates | References to load |
|---|---|---|---|
| Lite | Single key TTL/pattern review, ≤3 cached entities | 1–4 | None |
| Standard | Full cache layer design (pattern + consistency + failure modes) | 1–4 | cache-patterns.md |
| Deep | Multi-service cache architecture, hot key analysis, consistency SLA | 1–4 | Both reference files |
Force Standard or higher when any signal appears: write-behind or write-through pattern, distributed lock, multi-service shared cache, consistency SLA < 5s, cache as authoritative store for any data, hot key with >10K QPS.
§4 Degradation Modes
When context is incomplete, degrade gracefully — never fabricate assumptions about consistency requirements.
| Available context | Mode | What you can do | What you cannot do |
|---|---|---|---|
| Full (version, mode, eviction, source, consistency SLA) | Full | Complete strategy with quantified staleness | — |
| Source + consistency known, infra unknown | Degraded | Pattern selection + consistency design; flag infra unknowns | Eviction/memory recommendations |
| Only code snippets, no architecture context | Minimal | Static review of caching patterns in code | Full strategy design |
| No code (greenfield design request) | Planning | Propose cache strategy from requirements | Review existing implementation |
Hard rule: Never claim a caching strategy is "consistent" without defining the staleness window. In Degraded/Minimal mode, flag "consistency SLA undefined" in §9.9.
§5 Cache Strategy Checklist
Execute every item. Mark PASS / WARN / FAIL with evidence.
5.1 Pattern Selection
Cache pattern identified and justified — which pattern (cache-aside / write-through / write-behind / dual-write debounce) is used and why? The pattern must match the read:write ratio and consistency requirement. When uncertain → load
references/cache-patterns.md.Source of truth explicitly defined — is the database or the cache the authoritative source? Ambiguity here is the #1 cause of data inconsistency bugs. Rule: the database is almost always the source of truth; the cache is a derived, disposable copy.
Invalidation strategy defined — how and when is stale cache data removed? Options: TTL-based expiration, explicit invalidation on write, event-driven invalidation (CDC/pub-sub). At least one must be active.
5.2 Key Design & TTL
Key naming follows namespace convention —
{service}:{entity}:{id}or{tenant}:{domain}:{version}:{id}. Keys must be deterministic, greppable, and avoid collisions. No bare numeric IDs.TTL is set with jitter — every cached key must have a TTL. Add random jitter (±10-20%) to prevent synchronized expiration (cache avalanche). No immortal keys unless explicitly justified.
Key and value size bounded — keys < 1KB, values < 10KB as default guidance. Large values should use Hash fields or compression. Check with
redis-cli --bigkeys.Eviction policy matches access pattern —
allkeys-lrufor general caching,volatile-lrufor mixed TTL/permanent keys,allkeys-lfufor frequency-based (Redis 4.0+). Mismatched policy causes unpredictable evictions.
5.3 Failure Mode Defense
Stampede (thundering herd) protection — when a hot key expires, hundreds of concurrent requests hit the database simultaneously. Defense: singleflight/mutex pattern (only one goroutine/thread fetches, others wait), or stale-while-revalidate.
Penetration protection — requests for non-existent IDs bypass cache and always hit DB. Defense: cache null/empty results with short TTL (30-60s), or bloom filter at cache layer.
Avalanche protection — mass key expiration at same time overwhelms DB. Defense: TTL jitter (item 5), multi-level cache (L1 local + L2 Redis), circuit breaker on DB calls.
Hot key mitigation — single key receiving disproportionate traffic. Defense: local in-process cache (L1), key sharding (
key:{hash%N}), or read replicas. Detect withredis-cli --hotkeys(Redis 4.0+ LFU mode).
5.4 Consistency & Operations
Staleness window quantified — define in seconds/minutes how stale cached data can be. This is a business decision, not a technical default. Document it and monitor actual staleness.
Distributed lock bounded — if using Redis locks (SETNX + EX), ensure: (a) lock has TTL to prevent deadlock, (b) lock value is unique token for safe release, (c) release uses Lua CAS to prevent releasing someone else's lock. Consider whether the lock actually needs to be distributed.
Cache-down degradation path — what happens when Redis is unreachable? Options: serve stale from local cache, bypass to DB directly (with rate limiting), return degraded response. "Service crashes" is not an acceptable answer.
§6 Pattern Selection (Standard + Deep)
Quick decision guide — for full patterns load references/cache-patterns.md.
| Scenario | Recommended Pattern | Why |
|---|---|---|
| Read-heavy, moderate staleness OK | Cache-Aside | Simplest; app controls both read and invalidation |
| Read-heavy, immediate freshness needed | Write-Through | Cache updated synchronously on every write |
| Write-heavy, async durability acceptable | Write-Behind | Defers DB writes; highest throughput but data loss risk |
| Hot key with concurrent updates | Dual-Write Debounce | Absorbs race windows via delayed second invalidation |
Cache warmup strategies (for cold start)
- Lazy warmup: first request populates cache (accept initial latency spike)
- Eager warmup: pre-populate on deploy via batch scan of hot entities
- Gradual warmup: route increasing traffic percentage through cache layer (canary)
§7 Anti-Examples
AE-1: Immortal cache key — no TTL set
// WRONG: key lives forever; stale data never expires
rdb.Set(ctx, "user:123", userData, 0) // 0 = no expiration
// RIGHT: always set TTL with jitter
ttl := 30*time.Minute + time.Duration(rand.Intn(300))*time.Second
rdb.Set(ctx, "user:123", userData, ttl)
AE-2: Write-behind without durable queue
// WRONG: write to Redis, async goroutine writes DB — if process crashes, data lost
rdb.Set(ctx, key, value, ttl)
go func() { db.Save(value) }() // fire-and-forget = data loss risk
// RIGHT: use durable queue (Kafka, Redis Stream with ACK) between cache and DB
AE-3: Cache-aside without stampede protection
// WRONG: 1000 concurrent requests all miss cache, all query DB simultaneously
val, err := rdb.Get(ctx, key).Result()
if err == redis.Nil {
val = db.Query(id) // 1000 goroutines hit DB at once
rdb.Set(ctx, key, val, ttl)
}
// RIGHT: use singleflight to deduplicate concurrent cache fills
val, err, _ = sfGroup.Do(key, func() (interface{}, error) {
return db.Query(id)
})
AE-4: KEYS command for batch invalidation
// WRONG: KEYS blocks Redis for the entire scan — O(N) on all keys
keys, _ := rdb.Keys(ctx, "user:*").Result()
rdb.Del(ctx, keys...)
// RIGHT: use SCAN with bounded cursor iteration, or structured invalidation
AE-5: Distributed lock without TTL or safe release
// WRONG: lock has no TTL — if holder crashes, lock is held forever (deadlock)
rdb.SetNX(ctx, "lock:order:123", "1", 0)
// Also WRONG: releasing without checking ownership
rdb.Del(ctx, "lock:order:123") // may delete someone else's lock
// RIGHT: TTL + unique token + Lua CAS release
token := uuid.New().String()
rdb.SetNX(ctx, "lock:order:123", token, 10*time.Second)
// Release with Lua: if redis.call('get',KEYS[1])==ARGV[1] then return redis.call('del',KEYS[1]) end
AE-6: Cache issue reported as business logic bug
-- WRONG: "Bug: user sees old profile after update"
-- This is a cache staleness issue, not a logic bug. Check invalidation strategy.
-- RIGHT: report as "Cache consistency: stale read after write — invalidation delay"
Extended anti-examples (AE-7 through AE-13) in references/cache-anti-examples.md.
§8 Cache Strategy Scorecard
Critical — any FAIL means overall FAIL
- Cache-DB consistency strategy explicitly defined (not "write both and hope")
- TTL set on all cached keys with jitter (no immortal keys without justification)
- Cache-down degradation path exists (Redis unavailable ≠ service down)
Standard — 4 of 5 must pass
- Cache pattern matches business scenario (not blindly cache-aside for everything)
- Stampede protection for hot keys (singleflight / mutex / stale-while-revalidate)
- Penetration protection (null-value caching or bloom filter)
- Key naming follows
{namespace}:{entity}:{id}convention - Distributed locks have TTL and safe CAS release
Hygiene — 3 of 4 must pass
- Cache hit rate monitoring configured
- Eviction policy matches data access pattern (LRU/LFU/volatile)
- Key and value sizes within bounds (<1KB key, <10KB value)
- Warmup strategy defined for cold start / deployment
Verdict: X/12; Critical: Y/3; Standard: Z/5; Hygiene: W/4.
PASS requires: Critical 3/3 AND Standard ≥4/5 AND Hygiene ≥3/4.
§9 Output Contract
Every cache strategy review MUST produce these sections. Write "N/A — [reason]" if inapplicable.
### 9.1 Context Gate
| Item | Value | Source |
### 9.2 Depth & Mode
[Lite/Standard/Deep] × [review/design/troubleshoot] — [rationale]
### 9.3 Risk Assessment
| Component | Pattern | Risk | Notes |
### 9.4 Strategy Design (Standard/Deep; "N/A — Lite" for Lite)
- Pattern selection + justification
- Consistency model + staleness SLA
- Failure mode defenses
### 9.5 Implementation (key schema, TTL config, code patterns)
### 9.6 Validation Plan
- Cache hit rate target
- Staleness measurement
- Failure injection tests (Redis down, hot key, mass expiry)
### 9.7 Degradation Plan (what happens when cache fails)
### 9.8 Monitoring & Alerts
- Hit rate, latency, eviction rate, big key detection
### 9.9 Uncovered Risks (MANDATORY — never empty)
| Area | Reason | Impact | Follow-up |
Volume rules:
- FAIL findings: always fully detailed with fix
- WARN findings: up to 10; overflow to §9.9
- PASS: summary only
- §9.9 minimum: document all assumptions (especially consistency SLA if undefined)
Scorecard summary (append after §9.9):
Scorecard: X/12 — Critical Y/3, Standard Z/5, Hygiene W/4 — PASS/FAIL
Data basis: [full context | degraded | minimal | planning]
§10 Reference Loading Guide
| Condition | Load |
|---|---|
| Standard or Deep depth | references/cache-patterns.md |
| Deep depth, or stampede/penetration/avalanche signals | references/cache-failure-modes.md |
| Extended anti-example matching | references/cache-anti-examples.md |