matlab-optimize-performance - SKILL.md Agent Skill

name: matlab-optimize-performance description: "Read BEFORE optimizing any MATLAB code for speed. Without this workflow, agents commonly optimize the wrong target, fabricate speedup claims without measurement, or introduce regressions. Guides the 7-step workflow: baseline, profile, identify, optimize, measure, verify, report." license: MathWorks BSD-3-Clause metadata: author: MathWorks version: "1.0"

MATLAB Performance Optimization Workflow

Systematic 7-step workflow for finding and fixing performance bottlenecks in MATLAB code.

When to Use

User asks to speed up or optimize MATLAB code
User wants to find why their MATLAB code is slow
User has a function or script that takes too long to run
User asks to benchmark or time MATLAB code
User wants to compare performance before and after a change
User asks about MATLAB performance best practices

When NOT to Use

Optimizing Simulink model simulation speed (use Simulink Profiler)
The bottleneck is in compiled C/MEX code that can't be changed at the M-code level
The performance issue is purely I/O-bound (file reads, network, database)
User wants to write performance tests (use the writing-matlab-perf-tests skill)

The 7-Step Workflow

Step 1: Establish Baseline

Measure current performance so you have a number to improve against.

For a single function:

f = @() targetFunction(input1, input2);
baseline = timeit(f);
fprintf('Baseline: %.4f s\n', baseline);

For GPU code:

f = @() gpuFunction(gpuInput);
baseline = gputimeit(f);

For a script or multi-step workflow:

% Warmup run (first call includes JIT compilation)
myWorkflow(inputs);

% Timed run
tic;
myWorkflow(inputs);
baseline = toc;
fprintf('Baseline: %.4f s\n', baseline);

timeit is preferred because it handles warmup and runs multiple samples automatically.

Step 2: Profile and Analyze

Find where the time is actually spent. Do NOT guess — always profile.

profile on;
targetFunction(input1, input2);
profile off;
profile viewer;

Reading profiler results:

Function summary — shows total time and self-time per function. Self-time is time spent in that function, not its callees. Start with the highest self-time.
Per-line detail — click a function name to see time spent on each line. This reveals the exact bottleneck lines.
Call count — functions called thousands/millions of times are prime optimization targets.

Tips:

Run the profiled code multiple times (in a loop) if it's very fast, so the profiler collects enough samples
Look at self-time, not total time, to find the true bottleneck
Drill into functions — the summary page only tells part of the story

Step 3: Identify Optimization Opportunities

Based on profiling results, identify which patterns apply. Read references/optimization-patterns.md for the full catalog.

High-impact patterns:

Pattern	Typical Speedup	Look For
Vectorization	2–200x	Loops doing element-wise math on arrays
Preallocation	2–100x	Arrays growing inside loops (`x = [x; newRow]`)
Unnecessary recomputation	2–50x	Same expensive expression computed multiple times
`discretize`/`histcounts`	2–50x	Loops binning or classifying data
Persistent caching	1.5–95x	Repeated `load()` or expensive object creation
Logical indexing	1.2–5x	Using `find()` just to index into an array
`arguments` block	1.1–1.8x	Functions using `inputParser`
Algebraic simplification	1.5–3x	Redundant `sqrt`, `abs`, or matrix ops

Before optimizing, verify the target is worth it:

Is self-time > 10% of total? If not, optimizing it won't matter much.
Is it called in a tight loop? High call count × small time = big total.
Is it M-code or a built-in? You can't make a built-in faster, but you can often call it fewer times (e.g., pass a matrix to filtfilt/filter instead of looping over columns).

Step 4: Implement Optimizations

Apply the patterns identified in Step 3. See references/optimization-patterns.md for the full catalog with before/after code examples.

General principles:

Start with the highest-impact pattern from profiling
Move invariant work out of loops (object creation, option parsing, constant expressions)
Replace element-wise loops with array operations where possible
Use purpose-built functions (discretize, cumsum, hypot) instead of hand-written equivalents
For large data, batch the vectorization to control memory (see Pattern 9 in catalog)

Example — move invariant work out of loops:

% Before: repeated expensive setup
for i = 1:n
    opts = optimoptions('fminunc', 'Display', 'off');
    result(i) = fminunc(@(x) cost(x, data(i)), x0, opts);
end

% After: setup once
opts = optimoptions('fminunc', 'Display', 'off');
for i = 1:n
    result(i) = fminunc(@(x) cost(x, data(i)), x0, opts);
end

Step 5: Measure Optimized Performance

Re-measure using the same method as Step 1:

f = @() optimizedFunction(input1, input2);
optimized = timeit(f);
speedup = baseline / optimized;
fprintf('Optimized: %.4f s (%.2fx speedup)\n', optimized, speedup);

A speedup of 1.2x or more is considered significant. Below that, measurement noise makes it hard to be confident the change helped.

Step 6: Verify Correctness

Every optimization must produce the same results as the original:

original = originalFunction(input1, input2);
fast = optimizedFunction(input1, input2);

% Numeric comparison (allows floating-point tolerance)
maxErr = max(abs(original(:) - fast(:)));
fprintf('Max error: %.2e\n', maxErr);
assert(maxErr < 1e-10, 'Results differ beyond tolerance!');

For non-numeric outputs:

assert(isequal(original, fast), 'Results differ!');

If results differ slightly due to floating-point reordering (e.g., summing in a different order), that's usually acceptable. Document the expected tolerance.

Step 7: Report Results

Summarize what was done and the improvement achieved:

fprintf('\n=== Performance Optimization Report ===\n');
fprintf('Target: %s\n', funcName);
fprintf('Baseline: %.4f s\n', baseline);
fprintf('Optimized: %.4f s\n', optimized);
fprintf('Speedup: %.2fx\n', speedup);
fprintf('Correctness: max error = %.2e\n', maxErr);
fprintf('Pattern applied: %s\n', patternName);

For multiple optimizations, report each speedup individually and the overall end-to-end improvement.

Key Rules

Always profile before optimizing — never guess where the bottleneck is
One change at a time — measure after each optimization to know what helped
Verify correctness — every optimization must produce equivalent output
1.2x threshold — speedups below 1.2x are not reliably distinguishable from noise
GPU timing — always wait(gpuDevice) before and after timing GPU code
Use timeit — it handles warmup and averaging; avoid raw tic/toc for benchmarks

Common Mistakes

Mistake	Why It's Wrong	Do This Instead
Optimizing without profiling	You'll fix the wrong thing	Profile first (Step 2)
Single `tic/toc` without warmup	Includes JIT compilation time	Use `timeit` or add a warmup call
Timing GPU code without sync	GPU ops are async; `toc` fires early	`wait(gpuDevice)` before and after
Growing arrays in loops	Each append copies the entire array	Preallocate before the loop
Vectorizing huge arrays blindly	May exceed memory	Use chunked processing for large data
Reporting only subfunction speedup	Misleading if subfunction is 5% of total	Always report end-to-end timing
Assuming faster = correct	Bugs can make code fast (by skipping work)	Always verify results match (Step 6)

Reference Files

references/optimization-patterns.md — Full catalog of optimization patterns with code examples and measured speedups
references/measurement-templates.md — Ready-to-use MATLAB script templates for each workflow step