name: performance-optimization description: | Optimizes application performance with the MEASURE → IDENTIFY → FIX → VERIFY → GUARD workflow. Covers frontend Core Web Vitals (LCP ≤ 2.5s, INP ≤ 200ms, CLS ≤ 0.1), bundle budgets, and React re-render hygiene, PLUS backend latency budgets, Postgres/MySQL EXPLAIN ANALYZE workflows, N+1 query detection, connection-pool sizing, and async profiling with py-spy / cProfile. Use when performance requirements exist, when Core Web Vitals or p95 latency miss thresholds, or when profiling reveals bottlenecks that need fixing. category: Performance tags: [performance, web-vitals, lcp, inp, cls, profiling, n+1, postgres, mysql, py-spy, cprofile, react] triggers: - "performance" - "optimize" - "slow endpoint" - "Core Web Vitals" - "LCP" - "INP" - "CLS" - "p95 latency" - "EXPLAIN ANALYZE" - "N+1 query" - "connection pool" - "py-spy" - "bundle size" user_invocable: true related: - tech-debt-analysis - bug-scrub
Performance Optimization
Overview
Measure before optimizing. Performance work without measurement is guessing — and guessing leads to premature optimization that adds complexity without improving what matters. Profile first, identify the actual bottleneck, fix it, measure again. Optimize only what measurements prove matters.
When to Use
- Performance requirements exist in the spec (load-time budgets, response-time SLAs)
- Users or monitoring report slow behavior
- Core Web Vitals scores are below thresholds
- p95 latency on an endpoint exceeds its budget
- You suspect a recent change introduced a regression
- Building features that handle large datasets or high traffic
When NOT to use: Don't optimize before you have evidence of a problem. Premature optimization adds complexity that costs more than the performance it gains.
The Optimization Workflow
1. MEASURE → Establish baseline with real data (synthetic + RUM, or APM + DB log)
2. IDENTIFY → Find the actual bottleneck (not the assumed one)
3. FIX → Address the specific bottleneck; one variable at a time
4. VERIFY → Measure again; confirm the improvement is real and worth the cost
5. GUARD → Add monitoring or a regression test so it stays fixed
Skipping any step turns the work into theater. GUARD is the most commonly skipped step and the reason regressions reappear within a quarter.
Frontend: Core Web Vitals Targets
| Metric | Good | Needs Improvement | Poor |
|---|---|---|---|
| LCP (Largest Contentful Paint) | ≤ 2.5s | ≤ 4.0s | > 4.0s |
| INP (Interaction to Next Paint) | ≤ 200ms | ≤ 500ms | > 500ms |
| CLS (Cumulative Layout Shift) | ≤ 0.1 | ≤ 0.25 | > 0.25 |
For the deeper frontend AND backend checklist (TTFB, bundle budgets, image formats, font strategy, pool sizing), see references/performance-checklist.md.
Backend: Latency Budgets
Set p95 budgets per endpoint based on criticality. Echo references/performance-checklist.md:
| Endpoint criticality | p50 | p95 | p99 |
|---|---|---|---|
| Critical user-path (login, checkout, primary read) | ≤ 100ms | ≤ 300ms | ≤ 1s |
| Standard read | ≤ 200ms | ≤ 500ms | ≤ 1.5s |
| Standard write | ≤ 300ms | ≤ 800ms | ≤ 2s |
| Background / batch | ≤ 1s | ≤ 5s | ≤ 30s |
Budgets without alerts are wishes. Alert at 80% of the budget so you find regressions before users do.
Step 1 — Measure
Frontend
Two complementary approaches; use both:
- Synthetic (Lighthouse, DevTools Performance tab): controlled conditions, reproducible. Best for CI regression detection and isolating specific issues.
- RUM (
web-vitalslibrary, CrUX): real users in real conditions. Required to validate that a fix actually improved user experience.
import { onLCP, onINP, onCLS } from 'web-vitals';
onLCP(({ value }) => sendToAnalytics('lcp', value));
onINP(({ value }) => sendToAnalytics('inp', value));
onCLS(({ value }) => sendToAnalytics('cls', value));
Backend
- Application Performance Monitoring (APM) with per-endpoint p50/p95/p99
- Database query log with timing and
auto_explainfor slow queries - Distributed tracing across services (OpenTelemetry)
console.time('db-query');
const result = await db.query(/* ... */);
console.timeEnd('db-query');
EXPLAIN ANALYZE workflow
Postgres:
-- Use BUFFERS to see cache vs disk reads, ANALYZE to actually execute
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT t.id, t.title, u.name AS owner_name
FROM tasks t
JOIN users u ON u.id = t.owner_id
WHERE t.status = 'in_progress'
ORDER BY t.created_at DESC
LIMIT 20;
What to look for:
Seq Scanon a large table with aWHEREclause → missing index.Rows Removed by Filter>> rows returned → the index isn't selective enough; consider a partial or composite index.Buffers: shared read=Nwith high N → cold cache; benchmark twice (warm vs cold).-> SortwithSort Method: external merge→ the sort spilled to disk; raisework_memor add an index that returns rows pre-sorted.
MySQL:
EXPLAIN FORMAT=JSON
SELECT t.id, t.title, u.name AS owner_name
FROM tasks t JOIN users u ON u.id = t.owner_id
WHERE t.status = 'in_progress'
ORDER BY t.created_at DESC
LIMIT 20;
The JSON output exposes cost_info, used_key, rows_examined_per_scan, and attached_condition. A type: ALL access type is the equivalent of Postgres's Seq Scan.
Step 2 — Identify the Bottleneck
Use the symptom to decide what to measure first.
What is slow?
├── First page load
│ ├── Large bundle? ──────► Measure bundle size, check code splitting
│ ├── Slow server response? ─► Measure TTFB in DevTools Network waterfall
│ │ ├── DNS long? ──────► dns-prefetch / preconnect
│ │ ├── TCP/TLS long? ──► HTTP/2, edge deployment, keep-alive
│ │ └── Waiting (server) long? ─► Profile backend, check queries and caching
│ └── Render-blocking resources? ─► Check waterfall for CSS/JS blocking
├── Interaction feels sluggish
│ ├── UI freezes on click? ─► Profile main thread; long tasks (>50ms)
│ ├── Form input lag? ────► Check re-renders, controlled-component overhead
│ └── Animation jank? ────► Check layout thrashing, forced reflows
├── Page after navigation
│ ├── Data loading? ──────► API response times, request waterfalls
│ └── Client rendering? ──► Component render time, N+1 fetches
└── Backend / API
├── Single endpoint slow? ─► Profile DB queries, check indexes
├── All endpoints slow? ──► Connection pool, memory, CPU
└── Intermittent slowness?─► Lock contention, GC pauses, external deps
Common bottlenecks by category
Frontend:
| Symptom | Likely Cause | Investigation |
|---|---|---|
| Slow LCP | Large images, render-blocking resources, slow server | Check network waterfall, image sizes |
| High CLS | Images without dimensions, late-loading content, font shifts | Check layout-shift attribution in DevTools |
| Poor INP | Heavy JavaScript on main thread, large DOM updates | Check long tasks in Performance trace |
| Slow initial load | Large bundle, many network requests | Check bundle size, code splitting |
Backend:
| Symptom | Likely Cause | Investigation |
|---|---|---|
| Slow API responses | N+1 queries, missing indexes, unoptimized SQL | Database query log + EXPLAIN ANALYZE |
| Memory growth | Leaked references, unbounded caches, large payloads | Heap snapshot / tracemalloc |
| CPU spikes | Synchronous heavy computation, regex backtracking, JSON serialization of huge objects | py-spy top / cProfile |
| High latency under load | Connection-pool exhaustion, GIL contention, thread starvation | APM saturation metrics, py-spy dump |
| Intermittent slowness | Lock contention, GC pauses, slow external deps | Tracing across calls; histogram, not just average |
Step 3 — Fix Common Anti-Patterns
N+1 Queries (the single most common backend bug)
TypeScript / Prisma:
// BAD: N+1 — one query per task for the owner
const tasks = await db.tasks.findMany();
for (const task of tasks) {
task.owner = await db.users.findUnique({ where: { id: task.ownerId } });
}
// GOOD: Single query with join/include
const tasks = await db.tasks.findMany({ include: { owner: true } });
Python / SQLAlchemy:
# BAD: N+1 — accessing task.owner triggers a per-row SELECT
tasks = session.execute(select(Task)).scalars().all()
for task in tasks:
print(task.owner.name) # one SELECT per task
# GOOD: eager-load with joinedload
from sqlalchemy.orm import joinedload
tasks = (
session.execute(select(Task).options(joinedload(Task.owner)))
.scalars()
.unique()
.all()
)
Three patterns for fixing N+1, in order of preference:
SELECT ... IN (...)batch — when you have a list of foreign keys and want one extra query that fetches them all:SELECT * FROM users WHERE id IN (1, 2, 3, 4, 5);JOIN — when you want a single round trip and the cardinality blow-up is acceptable (one-to-one or one-to-few):
SELECT t.*, u.name AS owner_name FROM tasks t JOIN users u ON u.id = t.owner_id;DataLoader / batch-load pattern — when fetching is interleaved with rendering (GraphQL resolvers, async workflows). Coalesce all
getUser(id)calls in one tick into a singlegetUsers([ids])call.
Missing or wrong indexes
-- Run EXPLAIN ANALYZE; if you see Seq Scan + WHERE on `status`:
CREATE INDEX CONCURRENTLY idx_tasks_status_created_at
ON tasks (status, created_at DESC);
-- Composite order matters: equality column first, then sort/range column.
CONCURRENTLY lets the index build without locking writes; use it on production-sized tables.
Connection-pool sizing
A pool too small queues requests; too large overloads the database. The classic formula:
pool_size = ((core_count * 2) + effective_spindle_count)
For Postgres on modern SSDs, effective_spindle_count = 0..1. So a 4-core app server typically wants pool_size ≈ 8-10. Multiply by application instances and verify the total stays under the database's max_connections minus headroom for admin sessions and background jobs.
In Python with SQLAlchemy:
from sqlalchemy import create_engine
engine = create_engine(
DATABASE_URL,
pool_size=10, # steady-state connections
max_overflow=5, # burst capacity
pool_timeout=3, # fail fast under saturation, don't queue forever
pool_pre_ping=True, # detect dead connections after network blips
)
Async profiling (Python)
py-spy — sampling profiler, no code changes, attaches to a running process:
# Live top, like `top` but for Python frames
py-spy top --pid <PID>
# Record a flame graph for 30 seconds
py-spy record -o profile.svg --pid <PID> --duration 30
# One-shot dump of every thread's current stack — great for "stuck" processes
py-spy dump --pid <PID>
cProfile — deterministic profiler, in-process:
import cProfile, pstats
with cProfile.Profile() as prof:
run_workload()
stats = pstats.Stats(prof).sort_stats("cumulative")
stats.print_stats(30)
asyncio.gather for batching independent awaits:
# BAD: serial — total latency = sum of all calls
user = await fetch_user(uid)
prefs = await fetch_prefs(uid)
flags = await fetch_flags(uid)
# GOOD: parallel — total latency = max of all calls
user, prefs, flags = await asyncio.gather(
fetch_user(uid),
fetch_prefs(uid),
fetch_flags(uid),
)
asyncio.gather is the async equivalent of fixing N+1 at the HTTP-call layer: replace serial round trips with one concurrent batch.
Unbounded data fetching
# BAD: pulls every row into memory
tasks = session.execute(select(Task)).scalars().all()
# GOOD: paginate (server-side LIMIT)
tasks = (
session.execute(
select(Task).order_by(Task.created_at.desc()).limit(20).offset((page - 1) * 20)
)
.scalars()
.all()
)
For very large datasets, prefer cursor pagination (WHERE created_at < :cursor LIMIT 20) over OFFSET — OFFSET 100000 makes Postgres scan and discard 100k rows.
Image optimization (frontend LCP fix)
<!-- Hero / LCP image: art direction + resolution switching, high priority -->
<picture>
<source
media="(max-width: 767px)"
srcset="/hero-mobile-400.avif 400w, /hero-mobile-800.avif 800w"
sizes="100vw"
width="800" height="1000"
type="image/avif"
/>
<source
srcset="/hero-800.avif 800w, /hero-1200.avif 1200w, /hero-1600.avif 1600w"
sizes="(max-width: 1200px) 100vw, 1200px"
width="1200" height="600"
type="image/avif"
/>
<img
src="/hero-desktop.jpg"
width="1200" height="600"
fetchpriority="high"
alt="Hero image description"
/>
</picture>
<!-- Below-the-fold: lazy + async decoding -->
<img
src="/content.webp"
width="800" height="400"
loading="lazy" decoding="async"
alt="Content image description"
/>
Unnecessary re-renders (React)
// BAD: new object identity every render — children re-render
function TaskList() {
return <TaskFilters options={{ sortBy: 'date', order: 'desc' }} />;
}
// GOOD: stable reference
const DEFAULT_OPTIONS = { sortBy: 'date', order: 'desc' } as const;
function TaskList() {
return <TaskFilters options={DEFAULT_OPTIONS} />;
}
// React.memo for expensive children that get the same props
const TaskItem = React.memo(function TaskItem({ task }: Props) {
return <div>{/* expensive render */}</div>;
});
// useMemo for expensive derived computations
function TaskStats({ tasks }: Props) {
const stats = useMemo(() => calculateStats(tasks), [tasks]);
return <div>{stats.completed} / {stats.total}</div>;
}
Don't sprinkle React.memo and useMemo everywhere — that's its own bug source. Profile first; memoize only on hot paths the profiler points at.
Bundle splitting
// Modern bundlers (Vite, webpack 5+) tree-shake named imports automatically
// when the dependency ships ESM and is marked sideEffects: false. Profile
// before changing import styles — the real gains come from splitting.
const ChartLibrary = lazy(() => import('./ChartLibrary'));
const SettingsPage = lazy(() => import('./pages/Settings'));
function App() {
return (
<Suspense fallback={<Spinner />}>
<SettingsPage />
</Suspense>
);
}
Caching
// In-memory cache with TTL for frequently-read, rarely-changed data
const CACHE_TTL = 5 * 60 * 1000;
let cachedConfig: AppConfig | null = null;
let cacheExpiry = 0;
async function getAppConfig(): Promise<AppConfig> {
if (cachedConfig && Date.now() < cacheExpiry) return cachedConfig;
cachedConfig = await db.config.findFirst();
cacheExpiry = Date.now() + CACHE_TTL;
return cachedConfig;
}
// HTTP caching headers for static assets
app.use(
'/static',
express.static('public', {
maxAge: '1y',
immutable: true, // requires content-hashed filenames
})
);
Performance Budget
Set budgets and enforce them in CI:
JavaScript bundle: < 200KB gzipped (initial load)
CSS: < 50KB gzipped
Images: < 200KB per image (above the fold)
Fonts: < 100KB total
API response: < 200ms p95 (critical paths)
Time to Interactive: < 3.5s on 4G
Lighthouse Performance score: ≥ 90
npx bundlesize --config bundlesize.config.json
npx lhci autorun
For the full per-category checklist (frontend Core Web Vitals, backend latency, database, caching, monitoring), see references/performance-checklist.md.
Common Rationalizations
| Rationalization | Why it's wrong |
|---|---|
| "We'll optimize later" | Performance debt compounds: every new feature builds on the slow foundation. Fix obvious anti-patterns now (N+1, missing indexes, unbounded fetches); defer micro-optimizations. |
| "It's fast on my machine" | Your machine isn't the user's. Profile on representative hardware and networks; target p95, not p50. |
| "This optimization is obvious" | If you didn't measure, you don't know. Half of "obvious" optimizations make things slower (e.g. caching cold paths, memoizing trivial renders). |
| "Users won't notice 100ms" | Research shows 100ms delays measurably impact conversion. INP > 200ms is classified "Needs Improvement" by the standard. |
| "The framework handles performance" | Frameworks prevent some issues but can't fix N+1 queries, oversized bundles, missing indexes, or your custom hot path. |
| "I'll just add an index" | Adding indexes blindly slows writes and bloats storage. Run EXPLAIN ANALYZE first; index based on actual query plans. |
| "Async means it's parallel" | await in a loop is still serial. Use asyncio.gather / Promise.all for independent calls. |
| "Bigger pool = better throughput" | Beyond the formula, larger pools cause database thrashing and lock contention. Size, alert, and verify under load. |
Red Flags
- Optimization landed without before/after numbers in the PR description.
- N+1 query patterns in new data-fetching code (look for
for ... awaitnear a DB call, or ORM lazy-load access in a loop). - List endpoints without pagination, or pagination params that silently default to "all rows".
- Images without
width/height/loading/decodingattributes. - Bundle size grew without a budget check or reviewer comment.
- New endpoint added with no APM dashboard or alert.
React.memooruseMemoapplied to every component (cargo-cult memoization is its own perf bug).- Database query plans (
EXPLAIN ANALYZE) not attached to PRs that change SQL or add indexes. pool_timeoutset to a high value to "hide" pool exhaustion instead of fixing it.- Synchronous I/O on the event loop in async Python (a
requests.getinside anasync def).
Verification
- Before- and after-measurements exist with specific numbers (e.g. "p95 dropped from 740ms → 180ms over 1h baseline window").
- The specific bottleneck was identified and named — not "made it faster" but "removed N+1 in
list_tasksby switching tojoinedload(Task.owner)". - For frontend changes: Core Web Vitals (LCP, INP, CLS) are within "Good" thresholds in both synthetic and RUM samples.
- For backend changes: p95 latency for the affected endpoint sits inside its budget, and
EXPLAIN ANALYZEoutput is attached for any SQL change. - Bundle size delta is reported and within budget; if it grew, the growth is justified by a feature.
- No N+1 patterns introduced — show one query log line covering the new code path with
count = 1(orcount = 1 + 1for a batched fetch). - A regression guard is in place: a CI bundle-size check, a Lighthouse-CI run, an APM alert at 80% of budget, or a load test that runs in CI.
- Existing tests still pass; the optimization didn't change observable behavior.