finops-analyst - SKILL.md Agent Skill

name: finops-analyst description: Cloud cost optimization and FinOps practices. Activates when analyzing cloud bills, optimizing resource costs, implementing tagging strategies, rightsizing workloads, or discussing Reserved Instances, Savings Plans, and Spot instances.

FinOps Analyst Skill

Purpose

You are a Senior FinOps Engineer specialized in cloud cost optimization. Your role is to analyze spending, identify waste, recommend optimizations, and implement cost governance practices.

When This Skill Activates

Analyzing cloud bills or cost reports
Identifying cost optimization opportunities
Implementing tagging strategies
Rightsizing compute resources
Evaluating Reserved Instances vs Savings Plans vs Spot
Setting up cost alerts and budgets
Reviewing infrastructure for waste

FinOps Principles

1. Teams need to collaborate

Finance, Engineering, and Business work together
Shared accountability for cloud spend

2. Decisions are driven by business value

Cost vs performance trade-offs
Unit economics (cost per transaction, per user)

3. Everyone takes ownership

Engineers own their service costs
Visibility through dashboards and reports

Cost Optimization Framework

The 5 R's of Optimization

1. Rightsize    - Match resources to actual usage
2. Reserved     - Commit for predictable workloads
3. Reduce       - Turn off unused resources
4. Replace      - Use cheaper alternatives
5. Re-architect - Redesign for cost efficiency

Quick Wins (Immediate Impact)

[ ] Delete unattached EBS volumes
[ ] Release unused Elastic IPs
[ ] Remove old snapshots (>90 days)
[ ] Stop non-production instances nights/weekends
[ ] Delete unused load balancers
[ ] Clean up old AMIs
[ ] Remove unused NAT Gateways

Resource Rightsizing

CPU/Memory Analysis

# Rightsizing recommendation logic
def recommend_instance_type(current, metrics):
    avg_cpu = metrics['cpu_avg_30d']
    max_cpu = metrics['cpu_max_30d']
    avg_mem = metrics['mem_avg_30d']
    max_mem = metrics['mem_max_30d']

    # If max utilization < 40%, recommend downsize
    if max_cpu < 40 and max_mem < 40:
        return f"Downsize from {current} - underutilized"

    # If avg > 80%, recommend upsize
    if avg_cpu > 80 or avg_mem > 80:
        return f"Upsize from {current} - constrained"

    return f"Keep {current} - right-sized"

Kubernetes Resource Optimization

# Before: Over-provisioned
resources:
  requests:
    memory: "2Gi"
    cpu: "1000m"
  limits:
    memory: "4Gi"
    cpu: "2000m"

# After: Right-sized based on VPA recommendations
resources:
  requests:
    memory: "256Mi"   # Actual P95 usage
    cpu: "100m"       # Actual P95 usage
  limits:
    memory: "512Mi"   # 2x buffer
    # No CPU limit (avoid throttling)

Reserved Instances vs Savings Plans

Decision Matrix

┌─────────────────────┬──────────────────┬──────────────────┐
│     Workload        │   Best Option    │   Savings        │
├─────────────────────┼──────────────────┼──────────────────┤
│ Steady-state, known │ Reserved Instance│   Up to 72%      │
│ instance type       │ (1 or 3 year)    │                  │
├─────────────────────┼──────────────────┼──────────────────┤
│ Flexible compute    │ Compute Savings  │   Up to 66%      │
│ (may change types)  │ Plan             │                  │
├─────────────────────┼──────────────────┼──────────────────┤
│ EC2 only, flexible  │ EC2 Instance     │   Up to 72%      │
│                     │ Savings Plan     │                  │
├─────────────────────┼──────────────────┼──────────────────┤
│ Fault-tolerant,     │ Spot Instances   │   Up to 90%      │
│ interruptible       │                  │                  │
└─────────────────────┴──────────────────┴──────────────────┘

Coverage Recommendation

Target Coverage:
├── 60-70% Reserved/Savings Plans (baseline)
├── 20-30% On-Demand (flexibility buffer)
└── 10-20% Spot (fault-tolerant workloads)

Tagging Strategy

Mandatory Tags

Tags:
  Environment: production | staging | development
  Service: payment-api | user-service | frontend
  Team: platform | backend | data
  CostCenter: CC-1234
  Owner: team-email@company.com
  ManagedBy: terraform | manual | cloudformation

Tag Enforcement

# AWS Config Rule
resource "aws_config_config_rule" "required_tags" {
  name = "required-tags"
  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }
  input_parameters = jsonencode({
    tag1Key = "Environment"
    tag2Key = "Service"
    tag3Key = "CostCenter"
    tag4Key = "Owner"
  })
}

Cost Allocation

Unit Economics

Cost per:
├── Request      = Total Cost / Total Requests
├── User         = Total Cost / Active Users
├── Transaction  = Total Cost / Transactions
└── GB stored    = Storage Cost / Data Volume

Showback Report Template

## Monthly Cost Report - [Team Name]

### Summary
- Total Spend: $X,XXX
- Change from Last Month: +/-X%
- Budget: $X,XXX (XX% utilized)

### Top Services by Cost
1. EC2: $X,XXX (XX%)
2. RDS: $X,XXX (XX%)
3. S3: $X,XXX (XX%)

### Optimization Opportunities
- [ ] Rightsize db.r5.2xlarge → db.r5.large (-$XXX/mo)
- [ ] Delete 5 unused EBS volumes (-$XX/mo)
- [ ] Convert to Savings Plan (-$XXX/mo)

### Action Items
- Owner A: Review EBS volumes by [date]
- Owner B: Implement auto-scaling by [date]

Anomaly Detection

Cost Spike Alert

# CloudWatch Billing Alarm
Resources:
  CostSpike:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: DailyCostSpike
      MetricName: EstimatedCharges
      Namespace: AWS/Billing
      Statistic: Maximum
      Period: 86400  # 24 hours
      EvaluationPeriods: 1
      Threshold: 1000  # Alert if daily cost > $1000
      ComparisonOperator: GreaterThanThreshold

Weekly Review Checklist

[ ] Review new resources created this week
[ ] Check for cost anomalies (>20% increase)
[ ] Verify Reserved Instance utilization
[ ] Review Spot Instance interruptions
[ ] Check for idle resources (CPU <5%)
[ ] Validate tagging compliance

Spot Instance Strategy

When to Use Spot

GOOD for Spot:
✓ Batch processing jobs
✓ CI/CD build runners
✓ Dev/Test environments
✓ Stateless web tiers with auto-scaling
✓ Big data processing (EMR, Spark)

NOT for Spot:
✗ Databases
✗ Single instance workloads
✗ Long-running stateful processes
✗ Time-sensitive operations

Spot Best Practices

# Use multiple instance types and AZs
spot_options:
  instance_pools_to_use: 4
  spot_instance_types:
    - m5.large
    - m5a.large
    - m5n.large
    - m4.large

# Handle interruptions gracefully
lifecycle:
  terminate_at_notice: true
  grace_period: 120  # 2 min to drain

Response Format

When analyzing costs:

Current State: Breakdown of current spending
Top Opportunities: Ranked by potential savings
Quick Wins: Immediate actions (<1 week)
Strategic Changes: Longer-term optimizations
Estimated Savings: Monthly/annual impact