chaosmesh

name: chaosmesh compatibility: opencode completeness: 95 content-types:

guidance
examples
do-dont
config description: '''''Provides Chaos Mesh in Cloud-Native Engineering -u6DF7u6C8Cu5DE5u7A0B u5E73'' u53F0 for Kubernetes''''' domain: cncf license: MIT maturity: stable metadata: domain: cncf output-format: manifests role: reference scope: infrastructure triggers: chaosmesh, chaos, cloud-native, engineering archetypes:
- educational
- strategic anti_triggers:
- brainstorming
- vague ideation
- non-containerized architecture response_profile: verbosity: medium directive_strength: low abstraction_level: strategic version: "1.0.0" output-format: manifests related-skills: null role: reference scope: infrastructure triggers: chaos, chaosmesh, cloud-native, engineering version: "1.0.0"

Chaos Mesh in Cloud-Native Engineering

Category: chaos
Status: Incubating
Stars: 9,000
Last Updated: 2026-04-22
Primary Language: Go
Documentation: https://chaos-mesh.org/

Purpose and Use Cases

Chaos Mesh is a Chaos Engineering platform designed specifically for Kubernetes environments, enabling developers to simulate various failure scenarios to test system resilience.

What Problem Does It Solve?

The difficulty of testing distributed system resilience in production-like environments. It provides a declarative way to inject faults and observe system behavior to identify weaknesses.

When to Use This Project

Use Chaos Mesh when you want to test Kubernetes application resilience, simulate network failures, disk failures, pod failures, or need a comprehensive chaos engineering platform for Kubernetes.

Key Use Cases

Pod Failure Injection: Simulate pod crashes and failures
Network Chaos: Inject network delays, losses, and partitions
Disk Chaos: Simulate disk failures and I/O errors
Time Chaos: Inject time shifts for testing
Kernel Chaos: Inject kernel failures
HTTP Chaos: Inject HTTP request errors
Experiment Scheduling: Schedule chaos experiments
Workflow Orchestration: Complex chaos workflows

Architecture Design Patterns

Core Components

Chaos Dashboard: Web UI for experiment management
Controller Manager: Main controller for chaos experiments
Scheduler: Schedule chaos experiments
Chaos Daemon: Node-level chaos injection agent
API Server: REST API for chaos operations
Recorder: Record experiment results

Component Interactions

User → Dashboard: Create and manage experiments
Dashboard → API Server: REST API calls
API Server → Controller: Experiment creation
Controller → Scheduler: Schedule experiments
Scheduler → Experiments: Execute experiments
Experiments → Chaos Daemon: Inject chaos on nodes
Chaos Daemon → Container Runtime: Inject faults
Recorder → Database: Store experiment results

Data Flow Patterns

Experiment Creation: Dashboard → API → Controller → Experiment → Scheduler → Execution
Chaos Injection: Scheduler → Pod/Node → Chaos Daemon → Container Runtime → Fault Injection
Result Recording: Experiment → Recorder → Database → Dashboard display
Experiment Scheduling: Schedule request → Time-based scheduling → Execution → Cleanup

Design Principles

Kubernetes Native: Uses CRDs and Kubernetes API
Declarative: Chaos experiments as YAML manifests
Extensible: Custom chaos experiments and actions
Safe: Experiment rollback and safety controls
Observability: Experiment results and metrics
Flexible: Support for multiple chaos types

Integration Approaches

Integration with Other CNCF Projects

Kubernetes: Core platform for chaos injection
Prometheus: Chaos metrics collection
Grafana: Chaos experiment visualization
OpenTelemetry: Chaos tracing
Argo Workflows: Chaos workflow integration
Tekton: CI/CD chaos testing
Jaeger: Distributed tracing for chaos
Helm: Chaos Mesh deployment

API Patterns

Kubernetes API: CRD operations for experiments
REST API: Dashboard API
Webhook API: Experiment notifications
gRPC API: Internal service communication

Configuration Patterns

Experiment YAML: Declarative chaos experiments
Schedule YAML: Experiment scheduling configuration
Workflow YAML: Chaos workflow definitions
Controller Configuration: Controller settings

Extension Mechanisms

Custom Chaos Types: Implement custom chaos experiments
Webhooks: Custom notification hooks
Metrics Collectors: Custom metrics collection

Common Pitfalls and How to Avoid Them

Configuration Issues

Experiment YAML: Incorrect chaos experiment definitions
Role-Based Access: Missing RBAC permissions
Chaos Daemon: Daemon not running on nodes
Scheduler Configuration: Incorrect scheduling configuration
Network Policies: Blocking chaos daemon communication

Performance Issues

Experiment Overhead: High resource usage during chaos
Controller Load: High reconciliation overhead
Network Latency: Chaos impact on network
Disk I/O: Chaos impact on disk performance

Operational Challenges

Experiment Rollback: Manual rollback when needed
Experiment Safety: Unintended chaos effects
Multi-Cluster: Chaos across multiple clusters
Experiment Replay: Re-run experiments
Result Analysis: Analyze chaos results

Security Pitfalls

RBAC: Overly permissive chaos permissions
Network Isolation: Chaos affecting production networks
Experiment Scope: Too broad experiment scope
Access Control: Unauthorized chaos execution

Coding Practices

Idiomatic Configuration

Experiment CRDs: Declarative chaos experiments
Schedule CRDs: Declarative scheduling
Workflow CRDs: Chaos workflow definitions
Controller Config: Controller configuration

API Usage Patterns

kubectl apply: Create and update experiments
Dashboard API: Programmatic experiment management
Controller API: Experiment control API
Recorder API: Results access API

Observability Best Practices

Metrics: Experiment duration, success rate, failure rate
Logging: Experiment execution logs
Tracing: Chaos execution tracing
Dashboard: Visual experiment monitoring
Alerting: Experiment failure alerts

Development Workflow

Local Testing: minikube/kind for development
Debugging: Experiment logs inspection
Testing: Chaos experiment testing
CI/CD: Automated chaos testing
Tools: kubectl, chaos-dashboard, chaos-controller

Fundamentals

Essential Concepts

Experiment: Chaos experiment definition
Schedule: Scheduled experiment execution
Workflow: Complex chaos workflow
Chaos Daemon: Node-level chaos agent
Scheduler: Experiment scheduling service
Recorder: Results recording service

Terminology Glossary

Experiment: Chaos experiment definition
Schedule: Scheduled execution
Workflow: Complex chaos flow
Chaos Daemon: Node agent
Scheduler: Scheduling service
Recorder: Results storage

Data Models and Types

Experiment: Chaos experiment configuration
Schedule: Schedule configuration
Workflow: Workflow definition
Recorder: Results data
Status: Experiment status

Lifecycle Management

Experiment Lifecycle: Create → Schedule → Execute → Complete → Cleanup
Schedule Lifecycle: Schedule → Trigger → Execute → Complete
Workflow Lifecycle: Create → Run → Complete → Record

State Management

Experiment State: Current experiment status
Schedule State: Scheduled execution state
Workflow State: Workflow execution state
Recorder State: Results storage state

Scaling and Deployment Patterns

Horizontal Scaling

Controller Scaling: Multiple controller replicas
Scheduler Scaling: Multiple scheduler instances
Daemon Scaling: Chaos daemon scaling
Dashboard Scaling: Dashboard server scaling

High Availability

Controller HA: Multiple controller replicas
Scheduler HA: Scheduler cluster
Daemon HA: Daemon on all nodes
Dashboard HA: Dashboard server HA
Storage HA: etcd HA for state

Production Deployments

Controller Deployment: Production controller setup
Chaos Daemon: Node-level deployment
Network Policies: Allow chaos communication
RBAC: Proper access control
Monitoring: Experiment monitoring
Security: Security scanning

Upgrade Strategies

CRD Migration: Handle CRD schema changes
Controller Rolling Update: Zero-downtime updates
Daemon Rollout: Daemon update strategy
Dashboard Update: Dashboard upgrade

Resource Management

CPU/Memory Limits: Appropriate resource requests
Storage: Experiment state storage
Network: Daemon communication
Disk I/O: Chaos impact management

Additional Resources

Official Documentation: https://chaos-mesh.org/docs/
GitHub Repository: github.com/chaos-mesh/chaos-mesh
CNCF Project Page: cncf.io/projects/chaosmesh/
Community: Check the GitHub repository for community channels
Versioning: Refer to project's release notes for version-specific features

Troubleshooting

Common Issues

Deployment Failures
- Check pod logs for errors
- Verify configuration values
- Ensure network connectivity
Performance Issues
- Monitor resource usage
- Adjust resource limits
- Check for bottlenecks
Configuration Errors
- Validate YAML syntax
- Check required fields
- Verify environment-specific settings
Integration Problems
- Verify API compatibility
- Check dependency versions
- Review integration documentation

Getting Help

Check official documentation
Search GitHub issues
Join community channels
Review logs and metrics Content generated automatically. Verify against official documentation before production use.

Examples

Basic Configuration

# Basic configuration example
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{project_name}}-config
  namespace: default
data:
  # Configuration goes here
  config.yaml: |
    # Base configuration
    # Add your settings here

Kubernetes Deployment

# Kubernetes deployment for {{project_name}}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{project_name}}
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: {{project_name}}
  template:
    metadata:
      labels:
        app: {{project_name}}
    spec:
      containers:
      - name: {{project_name}}
        image: {{project_name}}:latest
        ports:
        - containerPort: 8080
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"

Kubernetes Service

# Kubernetes service for {{project_name}}
apiVersion: v1
kind: Service
metadata:
  name: {{project_name}}
  namespace: default
spec:
  selector:
    app: {{project_name}}
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: ClusterIP

When to Use

Use this skill when:

Integrating a CNCF project into Kubernetes infrastructure — You need to configure, deploy, or troubleshoot a cloud-native tool within a cluster
Designing cloud-native architecture — You are selecting and integrating CNCF tools to solve specific infrastructure challenges
Resolving operational issues — A CNCF component is misbehaving, underperforming, or needs configuration changes

Core Workflow

Assess Requirements — Understand the use case, scale, integration needs, and existing infrastructure. Checkpoint: Document requirements, constraints, and success criteria.
Design Architecture — Plan component interactions, data flow, and deployment strategy using cloud-native best practices. Checkpoint: Verify the architecture addresses all requirements and follows CNCF conventions.
Implement & Configure — Create manifests, configurations, and deployment scripts. Include resource limits, health checks, and observability hooks. Checkpoint: Validate all YAML against schema and test in a staging environment.
Deploy & Monitor — Apply manifests to the cluster, verify component health, and confirm observability is working. Checkpoint: Confirm all pods/services are running, probes passing, and metrics/alerts configured.