ai-reinforcement-learning - SKILL.md Agent Skill

name: ai-reinforcement-learning compatibility: opencode completeness: 95 content-types:

code
guidance
config
do-dont description: '"Provides Reinforcement Learning for automated trading agents and policy optimization"' license: MIT maturity: stable metadata: domain: trading output-format: code related-skills: ai-anomaly-detection, ai-explainable-ai role: implementation scope: implementation triggers: agents, ai reinforcement learning, ai-reinforcement-learning, automated, trading archetypes:
- tactical anti_triggers:
- brainstorming
- vague ideation
- no risk management response_profile: verbosity: low directive_strength: high abstraction_level: operational version: "1.0.0"

Role: Design and implement RL agents that learn optimal trading strategies through interaction with market environments

Philosophy: Trading is a sequential decision-making problem where agents learn from trial and error. Emphasize sample efficiency, stability, and generalization across market regimes.

Key Principles

Environment Design: Simulate realistic market conditions including transaction costs, slippage, and latency
State Representation: Include price features, order book dynamics, and portfolio state
Reward Engineering: Balance profit with risk constraints and transaction costs
Algorithm Selection: Prefer stable algorithms (PPO, SAC) over basic DQN for continuous action spaces
Backtesting Integration: Validate policies in simulated environments before live deployment

Implementation Guidelines

Structure

Core logic: rl/trading_agent.py - Agent class with policy network
Environment: rl/market_env.py - Custom gym environment
Training: rl/train.py - Training loop with callbacks
Config: config/rl_config.yaml - Hyperparameters and paths

Patterns to Follow

Use Stable Baselines3 or RLlib for production algorithms
Implement vectorized environments for parallel training
Add monitoring callbacks for metrics and early stopping
Save both policy and environment state for reproducibility

Adherence Checklist

Before completing your task, verify:

Environment simulates realistic trading constraints (costs, limits)
State space includes both market and portfolio features
Reward function penalizes excessive trading and drawdowns
Policy uses continuous actions (position size/alpha) not discrete
Training includes validation in out-of-sample periods

Code Examples

Basic Trading Environment

import gymnasium as gym
import numpy as np
from typing import Dict, Tuple

class TradingEnvironment(gym.Env):
    """Simple trading environment with transaction costs."""
    
    def __init__(self, prices: np.ndarray, initial_capital: float = 10000.0):
        super().__init__()
        self.prices = prices
        self.initial_capital = initial_capital
        self.n_steps = len(prices)
        
        # Action: -1 (sell), 0 (hold), 1 (buy) scaled to position size
        self.action_space = gym.spaces.Box(
            low=-1.0, high=1.0, shape=(1,), dtype=np.float32
        )
        
        # Observation: [price, position, cash, return_1d, return_5d, volatility]
        self.observation_space = gym.spaces.Box(
            low=-np.inf, high=np.inf, shape=(6,), dtype=np.float32
        )
        
        self.current_step = 0
        self.position = 0.0
        self.cash = initial_capital
        self.transaction_cost = 0.001  # 0.1% per trade
        
    def reset(self, seed=None):
        super().reset(seed=seed)
        self.current_step = 0
        self.position = 0.0
        self.cash = self.initial_capital
        return self._get_observation(), {}
    
    def _get_observation(self) -> np.ndarray:
        """Build state vector from market and portfolio data."""
        if self.current_step < 20:
            return np.array([100.0, 0.0, self.cash, 0.0, 0.0, 0.01], dtype=np.float32)
        
        price = self.prices[self.current_step]
        price_1d = self.prices[self.current_step - 1]
        price_5d = self.prices[self.current_step - 5]
        
        return_1d = (price - price_1d) / price_1d
        return_5d = (price - price_5d) / price_5d
        prices_window = self.prices[max(0, self.current_step-20):self.current_step]
        volatility = np.std(np.diff(np.log(prices_window))) if len(prices_window) > 1 else 0.01
        
        return np.array([
            price / 100.0,           # Normalized price
            self.position,           # Current position
            self.cash / self.initial_capital,  # Normalized cash
            return_1d,               # 1-day return
            return_5d,               # 5-day return
            volatility               # Historical volatility
        ], dtype=np.float32)
    
    def step(self, action: np.ndarray) -> Tuple[np.ndarray, float, bool, bool, Dict]:
        """Execute trading action and return new state."""
        action = np.clip(action, -1.0, 1.0)[0]
        
        current_price = self.prices[self.current_step]
        position_change = action - self.position
        
        # Calculate transaction costs
        cost = abs(position_change) * current_price * self.transaction_cost
        
        # Update position and cash
        self.position = action
        self.cash -= position_change * current_price + cost
        
        # Calculate reward: P&L plus risk penalty
        previous_value = self.position * current_price + self.cash
        self.current_step += 1
        
        if self.current_step >= self.n_steps:
            done = True
            reward = (self.cash + self.position * self.prices[-1]) / self.initial_capital - 1.0
        else:
            done = False
            next_price = self.prices[self.current_step]
            reward = self.position * (next_price - current_price) / current_price
            reward -= 0.01 * abs(position_change)  # Penalty for excessive trading
            reward -= 0.001 * np.std([reward]) if hasattr(self, 'rewards') else 0  # Risk aversion
        
        return self._get_observation(), reward, done, False, {}

PPO Trading Agent

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import CheckpointCallback
import numpy as np

class TradingAgent:
    """RL agent for trading using PPO algorithm."""
    
    def __init__(self, environment_params: Dict):
        self.env = make_vec_env(
            lambda: TradingEnvironment(**environment_params),
            n_envs=4
        )
        self.model = PPO(
            "MlpPolicy",
            self.env,
            verbose=1,
            n_steps=2048,
            batch_size=64,
            n_epochs=10,
            gamma=0.99,
            gae_lambda=0.95,
            ent_coef=0.01,
            clip_range=0.2,
            learning_rate=3e-4,
            policy_kwargs=dict(
                net_arch=[256, 128, 64],
                activation_fn=torch.nn.ReLU
            )
        )
    
    def train(self, total_timesteps: int = 100000):
        """Train the agent with checkpointing."""
        checkpoint_callback = CheckpointCallback(
            save_freq=10000,
            save_path="./models/",
            name_prefix="ppo_trading_agent"
        )
        
        self.model.learn(
            total_timesteps=total_timesteps,
            callback=checkpoint_callback
        )
    
    def predict(self, observation: np.ndarray) -> np.ndarray:
        """Get trading action for given state."""
        action, _ = self.model.predict(observation, deterministic=True)
        return action
    
    def save(self, path: str):
        """Save trained model."""
        self.model.save(path)
    
    def load(self, path: str):
        """Load trained model."""
        self.model = PPO.load(path)

Risk-Adjusted Reward Function

def risk_adjusted_reward(
    returns: np.ndarray,
    position: float,
    max_position: float = 1.0,
    risk_aversion: float = 0.5
) -> float:
    """Calculate reward with risk penalties."""
    # Sharp ratio approximation
    if len(returns) > 1 and np.std(returns) > 0:
        sharpe = np.mean(returns) / np.std(returns)
    else:
        sharpe = 0.0
    
    # Position penalty (avoid over-concentration)
    position_penalty = risk_aversion * (abs(position) / max_position) ** 2
    
    # Drawdown penalty
   cumulative_returns = np.cumsum(returns)
    running_max = np.maximum.accumulate(cumulative_returns)
    drawdown = running_max - cumulative_returns
    max_drawdown = np.max(drawdown) if len(drawdown) > 0 else 0.0
    drawdown_penalty = risk_aversion * 0.1 * max_drawdown
    
    return sharpe - position_penalty - drawdown_penalty

def transaction_cost(prices: np.ndarray, positions: np.ndarray, rate: float = 0.001) -> float:
    """Calculate total transaction costs."""
    position_changes = np.diff(positions, prepend=0)
    costs = abs(position_changes) * prices * rate
    return np.sum(costs)

Constraints

MUST DO

Validate input feature distributions against training data baselines; flag drift exceeding 2 standard deviations
Implement model versioning with reproducibility tags — every prediction must be traceable to the exact model artifact and config
Include confidence intervals or probability estimates alongside all point predictions, never return raw scores without context
Log all model inputs, outputs, and metadata to enable post-hoc analysis of prediction failures
Implement feature computation consistently between training and inference — use the same transformation pipeline for both

MUST NOT DO

Do not train models on look-ahead biased features (e.g., using future prices or events in training data)
Avoid deploying a new model version without shadow-testing against the current production model first
Never retrain a model on a data window that includes regime changes without explicit regime-aware validation
Do not use accuracy as the primary metric for imbalanced datasets — use precision/recall, F1, or AUC-ROC
Avoid hardcoding feature names; load them from a schema or config file to prevent mismatches between training and inference

Live References

Authoritative documentation links for this skill's domain. The model follows markdown links at load time to resolve external references and inline content.