portbench-llm-portfolio-benchmark - SKILL.md Agent Skill

name: portbench-llm-portfolio-benchmark description: "PortBench: correlation-aware full-pipeline benchmark for LLM-driven portfolio management. Dual-layer evaluation (static QA + dynamic allocation pipeline) with CEPS metric for compounding reasoning errors. 90% of LLMs fail to beat equal-weight. arXiv:2605.27887" tags: ["portfolio-management", "llm-benchmark", "finance", "quantitative-finance"] arxiv_id: "2605.27887"

PortBench: LLM-Driven Portfolio Management Benchmark

Correlation-aware, full-pipeline benchmark for evaluating LLM performance in portfolio management. Published 2026-05-27 (arXiv:2605.27887).

Core Problem

LLMs show strong performance on financial QA tasks, but portfolio management (PM) remains poorly benchmarked. Two main gaps in existing benchmarks:

Ignore cross-asset correlations — cannot distinguish genuinely diversified portfolios from concentrated ones
Fail to evaluate complete PM pipeline — don't test the real-world decision cycle

Methodology

Benchmark Design

Six heterogeneous asset classes over 10 years of historical data.

Two complementary evaluation layers:

Static QA layer: 6,269 correlation-based questions across 7 task templates
Dynamic pipeline layer: 5-stage allocation pipeline mirroring real PM decision cycle
- Stage 1: Market regime assessment
- Stage 2: Asset selection
- Stage 3: Weight allocation
- Stage 4: Risk constraint check
- Stage 5: Rebalancing decision

Novel Metrics

Dual-Layer Correlation Score: Measures whether proposed portfolios exploit inter-class hedging and avoid intra-class concentration.

CEPS (Compounding Error Propagation Score): Quantifies how reasoning errors compound across pipeline stages. Critical insight: errors at early stages amplify through the decision chain.

Evaluation Protocol

10 frontier LLMs evaluated
3 historical stress regimes: market crashes, volatility spikes, liquidity crises
3 risk profiles: conservative, moderate, aggressive
Strategy robustness and investor alignment assessment

Key Findings

90% of LLM-profile combinations fail to outperform equal-weight allocation — despite strong performance on static financial QA
Procedural compliance ≠ performance — models satisfying every constraint still suffer catastrophic drawdowns under stress
Static QA performance does not transfer to portfolio decisions — the gap between knowledge and action is large
Cross-asset correlation awareness is the key differentiator — portfolios that don't account for hedging relationships underperform systematically

Reusable Skill Patterns

Pattern 1: Correlation-Aware Benchmarking

When evaluating financial AI systems:
1. Test cross-asset correlation structures, not just single-asset predictions
2. Measure hedging effectiveness and diversification quality
3. Compare against naive baselines (equal-weight, market-cap weighted)

Pattern 2: Pipeline-Stage Error Analysis

For multi-stage decision systems:
1. Decompose the decision pipeline into discrete stages
2. Track error propagation: CEPS measures compounding effects
3. Identify which stages contribute most to final performance degradation

Pattern 3: Stress Regime Testing

Robust financial evaluation requires:
1. Testing under multiple historical stress scenarios
2. Separate evaluation for different risk profiles
3. Measuring drawdown severity, not just returns

Pattern 4: Procedural vs Performance Decoupling

Constraint satisfaction does not guarantee performance:
1. Evaluate both procedural compliance AND outcome quality
2. Models can follow all rules and still underperform
3. Need outcome-based metrics beyond checklist validation

Application Areas

LLM evaluation for financial tasks
Portfolio management system benchmarking
Multi-stage decision pipeline analysis
Cross-asset correlation modeling
Stress testing of AI financial advisors
Quantitative finance research

Pitfalls

LLM QA ≠ portfolio skill: Strong financial knowledge doesn't translate to good portfolio decisions
Correlation blindness: Many models ignore cross-asset relationships
Stress regime fragility: Models perform well in normal conditions but fail catastrophically under stress
Equal-weight is a strong baseline: Any portfolio strategy must beat 1/N allocation to be useful