name: aws-troubleshoot description: AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues. allowed-tools: Bash(aws *, python *)
AWS Troubleshooting Expertise
Investigation Methodology
- Identify the AWS resource/service involved
- Check resource status using describe functions
- Review CloudWatch logs for errors
- Check CloudWatch metrics for anomalies
- Analyze configuration for misconfigurations
- Synthesize and recommend
CloudWatch Logs Strategy
Partition First (CRITICAL)
Never dump all logs. Use aggregation queries first:
# Error rate over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)
# Top error messages
filter @message like /Exception/
| stats count(*) by @message
| sort count desc
| limit 10
# Latency percentiles
stats pct(@duration, 50) as p50, pct(@duration, 99) as p99 by bin(5m)
# Unique error types
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
Query Flow
- Statistics first: Get error counts, distributions
- Identify time windows: Find when errors spiked
- Sample from spikes: Get specific examples
- Compare to baseline: Query same period yesterday/last week
Service-Specific Patterns
EC2 Issues
| Symptom | First Check | Typical Cause |
|---|---|---|
| Unreachable | describe_ec2_instance |
Security group, stopped, status check failed |
| Performance | get_cloudwatch_metrics (CPUUtilization) |
CPU exhaustion, network saturation |
| Disk full | get_cloudwatch_metrics (DiskSpaceUtilization) |
Logs, temp files |
Key CloudWatch metrics for EC2:
- CPUUtilization
- NetworkIn, NetworkOut
- DiskReadOps, DiskWriteOps
- StatusCheckFailed
Lambda Issues
| Symptom | First Check | Typical Cause |
|---|---|---|
| Timeout | CloudWatch logs | External call slow, cold start, insufficient memory |
| Permission denied | CloudWatch logs | IAM role missing permissions |
| Memory error | CloudWatch metrics | Memory allocation too low |
| Cold starts | CloudWatch logs + metrics | Provisioned concurrency needed |
Key CloudWatch metrics for Lambda:
- Invocations
- Duration
- Errors
- Throttles
- ConcurrentExecutions
CloudWatch Insights for Lambda:
# Cold start analysis
filter @type = "REPORT"
| stats avg(@initDuration) as avg_cold_start,
count(@initDuration) as cold_starts,
count(*) as total_invocations
by bin(5m)
# Timeout analysis
filter @message like /Task timed out/
| stats count(*) by bin(5m)
ECS/Fargate Issues
| Symptom | First Check | Typical Cause |
|---|---|---|
| Task failed | list_ecs_tasks |
Container crash, resource limits, image pull |
| Service unhealthy | list_ecs_tasks |
Health check failing, target group issues |
| Slow scaling | CloudWatch metrics | Insufficient capacity, service limits |
Investigation flow:
list_ecs_tasks- See task status and health- Check stopped reason in task description
- Review CloudWatch logs for the task
- Check container insights metrics
RDS Issues
| Symptom | First Check | Typical Cause |
|---|---|---|
| Connection refused | get_rds_instance_status |
Security group, stopped, maintenance |
| Slow queries | CloudWatch metrics | CPU, IOPS, connections |
| Storage full | CloudWatch metrics | Data growth, logs, snapshots |
Key CloudWatch metrics for RDS:
- CPUUtilization
- DatabaseConnections
- ReadIOPS, WriteIOPS
- FreeStorageSpace
- ReadLatency, WriteLatency
Common AWS Errors
Permission Errors
AccessDeniedException
UnauthorizedAccess
→ Check IAM role/policy attached to the service
Throttling
Throttling
Rate exceeded
TooManyRequestsException
→ Implement exponential backoff, request limit increase
Resource Not Found
ResourceNotFoundException
NoSuchEntity
→ Verify resource name, region, account
Practical AWS CLI Commands
EC2
aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" --query 'Reservations[].Instances[].{ID:InstanceId,Type:InstanceType,State:State.Name,Name:Tags[?Key==`Name`].Value|[0]}'
aws ec2 describe-instance-status --instance-ids <id>
ECS
aws ecs list-clusters
aws ecs list-services --cluster <cluster>
aws ecs describe-services --cluster <cluster> --services <service>
aws ecs list-tasks --cluster <cluster> --service-name <service> --desired-status STOPPED
CloudWatch Logs
aws logs describe-log-groups --log-group-name-prefix /ecs/
aws logs filter-log-events --log-group-name <group> --start-time <epoch-ms> --filter-pattern "ERROR"
aws logs start-query --log-group-name <group> --start-time <epoch> --end-time <epoch> --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50'
EKS
aws eks list-clusters
aws eks describe-cluster --name <cluster>
aws eks update-kubeconfig --name <cluster> --region <region>
RDS
aws rds describe-db-instances --query 'DBInstances[].{ID:DBInstanceIdentifier,Engine:Engine,Status:DBInstanceStatus,Class:DBInstanceClass}'
aws cloudwatch get-metric-statistics --namespace AWS/RDS --metric-name CPUUtilization --dimensions Name=DBInstanceIdentifier,Value=<id> --start-time <iso> --end-time <iso> --period 300 --statistics Average