aws-observability-setup

star 1

Standard patterns for CloudWatch, CloudTrail, Config, logging, and alerting

jason-c-dev By jason-c-dev schedule Updated 1/30/2026

name: aws-observability-setup description: Standard patterns for CloudWatch, CloudTrail, Config, logging, and alerting version: 1.0.0 category: aws agents: [aws-coworker-core, aws-coworker-planner, aws-coworker-observability-cost] tools: [Read, Bash]

AWS Observability Setup

Purpose

This skill provides standardized patterns for setting up AWS observability including CloudWatch metrics, logs, and alarms; CloudTrail for audit; AWS Config for compliance; and Security Hub for security posture.

When to Use

  • Setting up monitoring for new resources
  • Establishing baseline observability
  • Creating alerting strategies
  • Implementing compliance logging
  • Reviewing observability gaps

When NOT to Use

  • Application-level monitoring (APM tools)
  • Third-party monitoring integration
  • Custom metrics development (specific to apps)

Observability Stack

Component Purpose
CloudWatch Metrics Performance and operational data
CloudWatch Logs Centralized log management
CloudWatch Alarms Alerting and automated actions
CloudTrail API activity audit trail
AWS Config Configuration compliance
VPC Flow Logs Network traffic analysis
Security Hub Security posture aggregation

CloudWatch Metrics

Standard Metrics to Monitor

EC2

Metric Alarm Threshold Period
CPUUtilization > 80% 5 min
StatusCheckFailed > 0 1 min
NetworkIn/Out Anomaly detection 5 min
EBSReadOps/WriteOps Baseline + 2 std dev 5 min

RDS

Metric Alarm Threshold Period
CPUUtilization > 80% 5 min
FreeStorageSpace < 20% of total 5 min
DatabaseConnections > 80% of max 5 min
ReadLatency/WriteLatency > baseline 5 min
FreeableMemory < 10% 5 min

Lambda

Metric Alarm Threshold Period
Errors > 5% of invocations 5 min
Duration > 80% of timeout 5 min
Throttles > 0 5 min
ConcurrentExecutions > 80% of limit 5 min

ALB/NLB

Metric Alarm Threshold Period
HTTPCode_ELB_5XX_Count > 10 5 min
HTTPCode_Target_5XX_Count > 10 5 min
TargetResponseTime > 1 second 5 min
UnHealthyHostCount > 0 1 min

Enabling Detailed Monitoring

# Enable detailed monitoring for EC2
aws ec2 monitor-instances \
  --instance-ids i-xxxxxxxxx \
  --profile {profile} \
  --region {region}

# Enable enhanced monitoring for RDS
aws rds modify-db-instance \
  --db-instance-identifier {db-id} \
  --monitoring-interval 60 \
  --monitoring-role-arn arn:aws:iam::{account}:role/rds-monitoring-role \
  --profile {profile} \
  --region {region}

CloudWatch Alarms

Alarm Configuration Pattern

# CloudFormation example
Resources:
  CPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-cpu-high"
      AlarmDescription: CPU utilization exceeds 80%
      MetricName: CPUUtilization
      Namespace: AWS/EC2
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: InstanceId
          Value: !Ref MyInstance
      AlarmActions:
        - !Ref AlertSNSTopic
      OKActions:
        - !Ref AlertSNSTopic

CLI Alarm Creation

# Create CPU alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "prod-web-cpu-high" \
  --alarm-description "CPU utilization exceeds 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --alarm-actions arn:aws:sns:{region}:{account}:alerts \
  --profile {profile} \
  --region {region}

Alarm Naming Convention

{env}-{service}-{metric}-{condition}

Examples:
- prod-web-cpu-high
- dev-rds-storage-low
- staging-lambda-errors-high

CloudWatch Logs

Log Group Configuration

# Create log group with retention
aws logs create-log-group \
  --log-group-name /aws/lambda/{function-name} \
  --profile {profile} \
  --region {region}

aws logs put-retention-policy \
  --log-group-name /aws/lambda/{function-name} \
  --retention-in-days 30 \
  --profile {profile} \
  --region {region}

Standard Log Groups

Service Log Group Pattern Retention
Lambda /aws/lambda/{function} 30 days
ECS /ecs/{cluster}/{service} 30 days
API Gateway /aws/api-gateway/{api} 30 days
VPC Flow Logs /vpc/flow-logs/{vpc} 90 days
Application /app/{service}/{env} 30-90 days

Log Insights Queries

-- Error analysis
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Lambda cold starts
fields @timestamp, @message, @duration
| filter @type = "REPORT" and @initDuration > 0
| sort @timestamp desc

-- Request latency percentiles
stats avg(@duration), pct(@duration, 95), pct(@duration, 99) by bin(5m)

CloudTrail

Trail Configuration

# Create organization trail
aws cloudtrail create-trail \
  --name org-audit-trail \
  --s3-bucket-name {audit-bucket} \
  --is-organization-trail \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --include-global-service-events \
  --profile {profile} \
  --region {region}

# Start logging
aws cloudtrail start-logging \
  --name org-audit-trail \
  --profile {profile} \
  --region {region}

CloudTrail Best Practices

## CloudTrail Configuration Checklist

- [ ] Multi-region trail enabled
- [ ] Log file validation enabled
- [ ] S3 bucket with encryption
- [ ] S3 bucket with access logging
- [ ] CloudWatch Logs integration
- [ ] Global service events included
- [ ] Data events for critical S3/Lambda (optional)

CloudTrail Event Analysis

# Query recent events via CLI
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances \
  --start-time 2024-01-01T00:00:00Z \
  --profile {profile} \
  --region {region}

VPC Flow Logs

Enable Flow Logs

# Create log group
aws logs create-log-group \
  --log-group-name /vpc/flow-logs/{vpc-id} \
  --profile {profile} \
  --region {region}

# Create flow log
aws ec2 create-flow-logs \
  --resource-ids vpc-xxxxxxxxx \
  --resource-type VPC \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /vpc/flow-logs/{vpc-id} \
  --deliver-logs-permission-arn arn:aws:iam::{account}:role/flow-logs-role \
  --profile {profile} \
  --region {region}

Flow Log Analysis

-- Rejected traffic
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 100

-- Top talkers by bytes
stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc
| limit 20

AWS Config

Enable Config Recording

# Create config recorder
aws configservice put-configuration-recorder \
  --configuration-recorder name=default,roleARN=arn:aws:iam::{account}:role/config-role \
  --recording-group allSupported=true,includeGlobalResourceTypes=true \
  --profile {profile} \
  --region {region}

# Create delivery channel
aws configservice put-delivery-channel \
  --delivery-channel name=default,s3BucketName={config-bucket} \
  --profile {profile} \
  --region {region}

# Start recording
aws configservice start-configuration-recorder \
  --configuration-recorder-name default \
  --profile {profile} \
  --region {region}

Essential Config Rules

Rule Purpose
s3-bucket-public-read-prohibited No public S3 buckets
encrypted-volumes EBS encryption required
rds-storage-encrypted RDS encryption required
ec2-instance-managed-by-ssm Systems Manager coverage
vpc-flow-logs-enabled Flow logs required

Security Hub

Enable Security Hub

# Enable Security Hub
aws securityhub enable-security-hub \
  --enable-default-standards \
  --profile {profile} \
  --region {region}

# Enable specific standards
aws securityhub batch-enable-standards \
  --standards-subscription-requests \
    StandardsArn=arn:aws:securityhub:::ruleset/cis-aws-foundations-benchmark/v/1.2.0 \
  --profile {profile} \
  --region {region}

Security Hub Findings

# Get critical findings
aws securityhub get-findings \
  --filters '{"SeverityLabel":[{"Value":"CRITICAL","Comparison":"EQUALS"}]}' \
  --profile {profile} \
  --region {region}

Dashboard Template

Production Dashboard

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "EC2 CPU Utilization",
        "metrics": [
          ["AWS/EC2", "CPUUtilization", "InstanceId", "i-xxx"]
        ],
        "period": 300
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "ALB Request Count",
        "metrics": [
          ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/xxx"]
        ],
        "period": 60
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "RDS Connections",
        "metrics": [
          ["AWS/RDS", "DatabaseConnections", "DBInstanceIdentifier", "xxx"]
        ],
        "period": 60
      }
    }
  ]
}

Observability Checklist

## Full Observability Checklist

### CloudWatch
- [ ] Detailed monitoring enabled
- [ ] Custom metrics where needed
- [ ] Alarms for critical metrics
- [ ] Dashboard created
- [ ] Anomaly detection configured

### Logging
- [ ] CloudWatch Logs for all services
- [ ] Retention policies set
- [ ] Log Insights queries saved
- [ ] Subscription filters for alerts

### Audit
- [ ] CloudTrail enabled (multi-region)
- [ ] Log file validation on
- [ ] S3 bucket secure
- [ ] CloudWatch Logs integration

### Network
- [ ] VPC Flow Logs enabled
- [ ] Traffic analysis configured
- [ ] Rejected traffic monitored

### Security
- [ ] Security Hub enabled
- [ ] Findings reviewed regularly
- [ ] Standards enabled (CIS, etc.)

### Compliance
- [ ] AWS Config recording
- [ ] Config rules deployed
- [ ] Non-compliance alerts

Related Skills

  • aws-cli-playbook — CLI patterns for setup
  • aws-well-architected — Operational excellence pillar
  • aws-governance-guardrails — Compliance requirements
  • aws-cost-optimizer — Cost of observability
Install via CLI
npx skills add https://github.com/jason-c-dev/aws-coworker-enterprise --skill aws-observability-setup
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator