name: matlab-optimize-performance description: "Read BEFORE optimizing any MATLAB code for speed. Without this workflow, agents commonly optimize the wrong target, fabricate speedup claims without measurement, or introduce regressions. Guides the 7-step workflow: baseline, profile, identify, optimize, measure, verify, report." license: MathWorks BSD-3-Clause metadata: author: MathWorks version: "1.0"
MATLAB Performance Optimization Workflow
Systematic 7-step workflow for finding and fixing performance bottlenecks in MATLAB code.
When to Use
- User asks to speed up or optimize MATLAB code
- User wants to find why their MATLAB code is slow
- User has a function or script that takes too long to run
- User asks to benchmark or time MATLAB code
- User wants to compare performance before and after a change
- User asks about MATLAB performance best practices
When NOT to Use
- Optimizing Simulink model simulation speed (use Simulink Profiler)
- The bottleneck is in compiled C/MEX code that can't be changed at the M-code level
- The performance issue is purely I/O-bound (file reads, network, database)
- User wants to write performance tests (use the
writing-matlab-perf-testsskill)
The 7-Step Workflow
Step 1: Establish Baseline
Measure current performance so you have a number to improve against.
For a single function:
f = @() targetFunction(input1, input2);
baseline = timeit(f);
fprintf('Baseline: %.4f s\n', baseline);
For GPU code:
f = @() gpuFunction(gpuInput);
baseline = gputimeit(f);
For a script or multi-step workflow:
% Warmup run (first call includes JIT compilation)
myWorkflow(inputs);
% Timed run
tic;
myWorkflow(inputs);
baseline = toc;
fprintf('Baseline: %.4f s\n', baseline);
timeit is preferred because it handles warmup and runs multiple samples automatically.
Step 2: Profile and Analyze
Find where the time is actually spent. Do NOT guess — always profile.
profile on;
targetFunction(input1, input2);
profile off;
profile viewer;
Reading profiler results:
- Function summary — shows total time and self-time per function. Self-time is time spent in that function, not its callees. Start with the highest self-time.
- Per-line detail — click a function name to see time spent on each line. This reveals the exact bottleneck lines.
- Call count — functions called thousands/millions of times are prime optimization targets.
Tips:
- Run the profiled code multiple times (in a loop) if it's very fast, so the profiler collects enough samples
- Look at self-time, not total time, to find the true bottleneck
- Drill into functions — the summary page only tells part of the story
Step 3: Identify Optimization Opportunities
Based on profiling results, identify which patterns apply. Read references/optimization-patterns.md for the full catalog.
High-impact patterns:
| Pattern | Typical Speedup | Look For |
|---|---|---|
| Vectorization | 2–200x | Loops doing element-wise math on arrays |
| Preallocation | 2–100x | Arrays growing inside loops (x = [x; newRow]) |
| Unnecessary recomputation | 2–50x | Same expensive expression computed multiple times |
discretize/histcounts |
2–50x | Loops binning or classifying data |
| Persistent caching | 1.5–95x | Repeated load() or expensive object creation |
| Logical indexing | 1.2–5x | Using find() just to index into an array |
arguments block |
1.1–1.8x | Functions using inputParser |
| Algebraic simplification | 1.5–3x | Redundant sqrt, abs, or matrix ops |
Before optimizing, verify the target is worth it:
- Is self-time > 10% of total? If not, optimizing it won't matter much.
- Is it called in a tight loop? High call count × small time = big total.
- Is it M-code or a built-in? You can't make a built-in faster, but you can often call it fewer times (e.g., pass a matrix to
filtfilt/filterinstead of looping over columns).
Step 4: Implement Optimizations
Apply the patterns identified in Step 3. See references/optimization-patterns.md for the full catalog with before/after code examples.
General principles:
- Start with the highest-impact pattern from profiling
- Move invariant work out of loops (object creation, option parsing, constant expressions)
- Replace element-wise loops with array operations where possible
- Use purpose-built functions (
discretize,cumsum,hypot) instead of hand-written equivalents - For large data, batch the vectorization to control memory (see Pattern 9 in catalog)
Example — move invariant work out of loops:
% Before: repeated expensive setup
for i = 1:n
opts = optimoptions('fminunc', 'Display', 'off');
result(i) = fminunc(@(x) cost(x, data(i)), x0, opts);
end
% After: setup once
opts = optimoptions('fminunc', 'Display', 'off');
for i = 1:n
result(i) = fminunc(@(x) cost(x, data(i)), x0, opts);
end
Step 5: Measure Optimized Performance
Re-measure using the same method as Step 1:
f = @() optimizedFunction(input1, input2);
optimized = timeit(f);
speedup = baseline / optimized;
fprintf('Optimized: %.4f s (%.2fx speedup)\n', optimized, speedup);
A speedup of 1.2x or more is considered significant. Below that, measurement noise makes it hard to be confident the change helped.
Step 6: Verify Correctness
Every optimization must produce the same results as the original:
original = originalFunction(input1, input2);
fast = optimizedFunction(input1, input2);
% Numeric comparison (allows floating-point tolerance)
maxErr = max(abs(original(:) - fast(:)));
fprintf('Max error: %.2e\n', maxErr);
assert(maxErr < 1e-10, 'Results differ beyond tolerance!');
For non-numeric outputs:
assert(isequal(original, fast), 'Results differ!');
If results differ slightly due to floating-point reordering (e.g., summing in a different order), that's usually acceptable. Document the expected tolerance.
Step 7: Report Results
Summarize what was done and the improvement achieved:
fprintf('\n=== Performance Optimization Report ===\n');
fprintf('Target: %s\n', funcName);
fprintf('Baseline: %.4f s\n', baseline);
fprintf('Optimized: %.4f s\n', optimized);
fprintf('Speedup: %.2fx\n', speedup);
fprintf('Correctness: max error = %.2e\n', maxErr);
fprintf('Pattern applied: %s\n', patternName);
For multiple optimizations, report each speedup individually and the overall end-to-end improvement.
Key Rules
- Always profile before optimizing — never guess where the bottleneck is
- One change at a time — measure after each optimization to know what helped
- Verify correctness — every optimization must produce equivalent output
- 1.2x threshold — speedups below 1.2x are not reliably distinguishable from noise
- GPU timing — always
wait(gpuDevice)before and after timing GPU code - Use
timeit— it handles warmup and averaging; avoid rawtic/tocfor benchmarks
Common Mistakes
| Mistake | Why It's Wrong | Do This Instead |
|---|---|---|
| Optimizing without profiling | You'll fix the wrong thing | Profile first (Step 2) |
Single tic/toc without warmup |
Includes JIT compilation time | Use timeit or add a warmup call |
| Timing GPU code without sync | GPU ops are async; toc fires early |
wait(gpuDevice) before and after |
| Growing arrays in loops | Each append copies the entire array | Preallocate before the loop |
| Vectorizing huge arrays blindly | May exceed memory | Use chunked processing for large data |
| Reporting only subfunction speedup | Misleading if subfunction is 5% of total | Always report end-to-end timing |
| Assuming faster = correct | Bugs can make code fast (by skipping work) | Always verify results match (Step 6) |
Reference Files
references/optimization-patterns.md— Full catalog of optimization patterns with code examples and measured speedupsreferences/measurement-templates.md— Ready-to-use MATLAB script templates for each workflow step
Copyright 2026 The MathWorks, Inc.