name: ansible description: >- Ansible automation conventions, patterns, and toolchain: playbook design, roles, inventory, vault, collections, execution environments, Event-Driven Ansible, testing, and performance tuning. Invoke whenever task involves any interaction with Ansible — writing playbooks, creating roles, managing inventory, reviewing automation code, debugging runs, upgrading ansible-core, or working with AAP.
Ansible
Idempotency is the highest Ansible virtue. Every task must describe desired state, not a sequence of commands.
References
Extended examples, patterns, and rationale for the rules below live in ${CLAUDE_SKILL_DIR}/references/.
- playbook-patterns — [
${CLAUDE_SKILL_DIR}/references/playbook-patterns.md]: Play execution order, static vs dynamic reuse comparison table, batched execution withserial, verification flags, standard directory layouts - role-structure — [
${CLAUDE_SKILL_DIR}/references/role-structure.md]: Extended directory tree, three ways to use roles (play-level, import_role, include_role) with examples, platform-specific task splitting, argument_specs example, dependency mechanics, deduplication rules - inventory-management — [
${CLAUDE_SKILL_DIR}/references/inventory-management.md]: YAML/INI examples, group hierarchy, environment separation, AWS/Azure/GCP/NetBox/Terraform plugins, multi-cloud chaining, caching - vault-and-security — [
${CLAUDE_SKILL_DIR}/references/vault-and-security.md]: File vs variable encryption, password sources, ansible-sign, GPG verification, hardening roles, compliance scanning, security smells - variables-and-templating — [
${CLAUDE_SKILL_DIR}/references/variables-and-templating.md]: Full 22-level precedence list, magic variables, YAML quoting gotcha, registered variables - error-handling — [
${CLAUDE_SKILL_DIR}/references/error-handling.md]: Block execution flow, rescue variables, failed_when/changed_when, any_errors_fatal - handlers-and-delegation — [
${CLAUDE_SKILL_DIR}/references/handlers-and-delegation.md]: Handler execution order, flushing, delegate_to, delegate_facts, fire-and-forget async - testing-and-performance — [
${CLAUDE_SKILL_DIR}/references/testing-and-performance.md]: Molecule lifecycle, driver comparison, CI matrix testing, strategy plugins, SSH pipelining, fact caching, serial batching - execution-environments-and-collections —
[
${CLAUDE_SKILL_DIR}/references/execution-environments-and-collections.md]: EE vs local installs, version 3 schema, FQCN migration, collection certification, Galaxy publishing, automation mesh, AAP 2.5/2.6 platform architecture - event-driven-ansible — [
${CLAUDE_SKILL_DIR}/references/event-driven-ansible.md]: Rulebook structure, event sources, conditions, actions, Event Streams, decision environments, Kafka vs webhooks, performance tuning, troubleshooting - porting-guide — [
${CLAUDE_SKILL_DIR}/references/porting-guide.md]: ansible-core 2.17/2.18 breaking changes, Python version requirements, removed modules, AAP 2.5/2.6 deprecations, upgrade strategy
Playbook Design
Naming and Clarity
- Always name plays, tasks, and blocks. Unnamed tasks produce opaque output that makes debugging impossible.
- Always specify
state:explicitly. Different modules have different defaults.state: present/state: absentmakes intent visible. - Always use FQCN (Fully Qualified Collection Names):
ansible.builtin.copy, notcopy. Prevents ambiguity when multiple collections are installed.
Idempotency
- Prefer declarative modules (
ansible.builtin.template,ansible.builtin.service,ansible.builtin.user) over imperative ones (ansible.builtin.command,ansible.builtin.shell) - When
command/shellis unavoidable, addcreates:,removes:, orchanged_when:to make it idempotent - Move complex logic into custom modules or filter plugins — Ansible is a desired state engine, not a scripting language
- Test idempotency: run twice, second run must report zero changes
Static vs Dynamic Reuse
import_tasks/import_role-- static, parsed at load time. Tags propagate to all imported tasks. Cannot loop. Use when structure is fixed.include_tasks/include_role-- dynamic, evaluated at runtime. Tags apply only to the include statement. Can loop and usewhen. Use when inclusion is conditional.
Default to import_* for predictability.
Project Layout
inventories/
production/
hosts
group_vars/
host_vars/
staging/
hosts
group_vars/
host_vars/
site.yml # imports tier playbooks
webservers.yml
dbservers.yml
roles/
common/
webserver/
database/
site.yml imports tier playbooks. Each tier playbook maps host groups to roles.
Roles
A role manages one service or component — not an entire stack. Keep provisioning separate from configuration and application deployment. Roles are not programming constructs: avoid deep inheritance hierarchies, tight coupling, or hard dependencies on external variables.
Structure
roles/my_role/
tasks/main.yml # entry point
handlers/main.yml # auto-imported into play scope
templates/*.j2 # Jinja2 templates
files/ # static files
defaults/main.yml # low-precedence (user-configurable)
vars/main.yml # high-precedence (internal constants)
meta/main.yml # dependencies
meta/argument_specs.yml # argument validation (2.11+)
defaults/ vs vars/
defaults/-- easily overridden. Use for knobs users should change (ports, paths, feature flags).vars/-- hard to override. Use for internal constants the role needs to function.
Naming
- Role names: lowercase, hyphens:
nginx-proxy,ssl-certs - Prefix all role variables with the role name:
nginx_port,nginx_worker_count - Prefix handler names with role name:
nginx : Restart nginx
Argument Validation
Define expected parameters in meta/argument_specs.yml. Validation runs before role tasks execute.
Dependencies
Defined in meta/main.yml. Run before the role. Deduplicated per play unless parameters differ or
allow_duplicates: true is set.
Inventory
Format
Prefer YAML over INI. INI :vars sections treat all values as strings, causing type confusion.
Grouping Strategy
Group along three dimensions:
- What (function):
webservers,dbservers,monitoring - Where (location):
dc1,dc2,us_east - When (environment):
production,staging,development
Environment Separation
Split large inventories by function or region — a single static file with 5,000+ hosts takes 15-30 seconds to load. Keep production and staging in separate inventory files or directories. Never mix environments in a single inventory -- developers using a mixed inventory need access to all vault passwords.
Dynamic Inventory
Use inventory plugins (not scripts) for cloud providers:
- AWS:
amazon.aws.aws_ec2-- groups from tags, instance types, regions - Azure:
azure.azcollection.azure_rm-- conditional groups, keyed groups - GCP:
google.cloud.gcp_compute-- zones, machine types, labels - NetBox:
netbox.netbox.nb_inventory-- single source of truth for hybrid environments, automatic group updates from tags/custom fields - Terraform:
cloud.terraform.terraform_state-- parse state files as inventory
Mix static and dynamic sources in the same inventory directory.
Constructed Inventory
Build groups dynamically from host metadata using Jinja2 logic. Chain multiple cloud inventories into a single constructed inventory for cross-cloud targeting. Successor to Smart Inventories in AAP.
Variables and Precedence
The 22-Level Precedence Rule
Role defaults/ is lowest. Extra vars (-e) always win. Most common layers:
- Overridable defaults → Role
defaults/main.yml - Environment-wide values →
group_vars/all.yml - Group-specific values →
group_vars/<group>.yml - Host-specific values →
host_vars/<host>.yml - Force a value in a role → Role
vars/main.yml - Override everything at runtime →
--extra-vars
Define each variable in ONE place.
YAML Quoting
Values starting with {{ }} must be quoted:
app_path: "{{ base_path }}/app" # correct
app_path: {{ base_path }}/app # YAML parse error
Common Gotchas
- Boolean coercion: YAML treats
yes,no,true,false,on,offas booleans. Quote strings that match:version: "yes", notversion: yes - Octal numbers: Leading zeros create octals in YAML 1.1.
mode: 0644becomes420(decimal). Usemode: "0644"for file permissions. - Dictionary merge:
combine()does shallow merge. Nested dicts are replaced, not merged. Usecombine(recursive=true)for deep merge. - Variable scope in loops:
set_factin a loop overwrites on each iteration. Useset_factwith{{ result | default([]) + [item] }}to accumulate.
Jinja2 Templating
All templating runs on the control node before task execution.
Key Patterns
- Filters:
{{ value | default('fallback') }},{{ list | unique }},{{ dict1 | combine(dict2) }} - Tests:
when: result is defined,when: path is file - Template files (.j2): support loops, conditionals, macros -- full Jinja2
Templates in Tasks vs Files
- Playbooks: only variable substitution and filters. No loops or conditionals in task arguments.
- Template files (.j2): full Jinja2 including
{% for %},{% if %},{% macro %}.
Vault and Security
The vars/vault Pattern
# group_vars/production/vars.yml (plaintext, searchable)
db_password: "{{ vault_db_password }}"
# group_vars/production/vault.yml (encrypted)
vault_db_password: "actual_secret"
Variable names remain greppable. Values stay encrypted.
Password Automation
- Never type vault passwords manually for every run
- Use a password file (
--vault-password-file vault_pass.txt) for local dev - Use a password script (
.vault_pass.sh) that fetches from a secrets manager for team environments - In CI/CD, pass vault passwords via
ANSIBLE_VAULT_PASSWORD_FILEenvironment variable pointing to a pipeline secret
External Secret Managers
For enterprise or compliance-heavy environments, shift from static vault files to runtime secret fetching via lookup plugins:
- HashiCorp Vault, AWS Secrets Manager, Azure Key Vault
- Eliminates manual vault file rotation
- Secrets never touch disk -- fetched at playbook runtime
Content Signing
Use ansible-sign with GPG to sign project content. Creates checksum manifests (SHA256) of protected files with
detached GPG signatures. AAP automation controller verifies signatures on project sync -- tampered projects fail to
update and no jobs launch. Automate signing in CI via ANSIBLE_SIGN_GPG_PASSPHRASE environment variable.
Security Hardening
- Use community CIS benchmark roles (e.g.,
ansible-lockdown) for automated compliance. Customize viadefaults/main.yml, select levels via tags. - Integrate OpenSCAP for compliance scanning and report generation.
- Watch for IaC security smells: root SSH login, command injection, plaintext secrets, unvalidated paths, outdated dependencies.
Error Handling
block/rescue/always
block:
- name: Deploy new version
# ... tasks that might fail
rescue:
- name: Rollback
# ... recovery tasks
always:
- name: Send notification
# ... runs regardless
rescueruns only when ablocktask failsalwaysruns regardless of block/rescue outcome- Rescue variables:
ansible_failed_task,ansible_failed_result - Hosts that fail in
blockbut succeed inrescueare reported as "rescued", not "failed" -- account for this in reporting
Result Aggregation Pattern
For multi-host runs, capture per-host status in block/rescue, then aggregate in always using
ansible_play_hosts_all with delegate_to: localhost and run_once: true. This produces a single summary of all
successes and failures across the fleet.
Error Control
failed_when:-- custom failure conditionschanged_when:-- control when a task reports "changed"ignore_errors: true-- continue on failure (use sparingly)any_errors_fatal: true-- stop entire play on any host failure
Retry
- name: Wait for service
ansible.builtin.uri:
url: http://localhost:8080/health
register: result
until: result.status == 200
retries: 30
delay: 10
Handlers
- Run once per play after all tasks complete (or on
meta: flush_handlers) - Execute in definition order, not notification order
- Multiple notifications to the same handler result in single execution
- Use
listen:topics to group related handlers - Never use variables in handler names -- use them in handler parameters
- Handlers from roles have global scope; prefix with
role_name : handler_name
Delegation and Async
Delegation
Execute a task on a different host: delegate_to: lb.example.com. Use for load balancer operations, centralized
notifications, cross-host coordination.
local_action: is shorthand for delegate_to: 127.0.0.1.
When multiple hosts delegate to the same target, use throttle: 1 or run_once: true to prevent race conditions.
become applies to the delegated host, not the original target -- verify escalation permissions.
Async
async: N, poll: M(M > 0) -- extended timeout, still blocksasync: N, poll: 0-- fire-and-forget, check later withasync_status- Do not use
poll: 0with tasks requiring exclusive locks (package managers)
Event-Driven Ansible (EDA)
EDA is the "Automation Decisions" component of AAP -- a decision engine that listens to event sources and triggers automated responses via rulebooks. Rulebooks are the event-driven equivalent of playbooks: YAML files with sources, conditions, and actions.
Key concepts:
- Rulebooks define event sources, conditions (
when), and actions (run_job_template,run_workflow_template,set_fact,debug) - Event Streams (AAP 2.5+) route a single webhook endpoint to multiple rulebook activations with credential integration -- use for production
- Decision environments are container images for running rulebooks (analogous to execution environments for playbooks)
- Supported controller sources:
alertmanager,aws_sqs_queue,azure_service_bus,kafka,pg_listener,webhook - Use Kafka for high-volume mission-critical streams; webhooks for simple integrations; Event Streams for production webhook scenarios
See [${CLAUDE_SKILL_DIR}/references/event-driven-ansible.md] for rulebook structure, event filters, scaling, and
troubleshooting.
Execution Environments and AAP Platform
Container images bundling Ansible Core, Runner, collections, and all dependencies. Replace traditional virtual environments for consistent automation execution.
- ansible-builder: Creates custom EEs from definition files (version 3 schema). Specify base image, Galaxy collections, Python packages, and system dependencies.
- ansible-navigator: Interactive TUI for playbook development. Drill into task outputs, inspect variables, replay artifacts for collaborative debugging. Tightly integrated with EEs for dev-prod parity.
- Automation mesh: Overlay network distributing workloads across execution nodes via peer-to-peer connections using Receptor. Scale execution capacity independently from the control plane.
Use EEs when: enterprise scale, complex dependencies, team consistency needed. Use local installs for: simple setups, ad-hoc tasks, beginners.
AAP 2.5/2.6 Awareness
AAP 2.5 introduced a unified UI, Platform Gateway (single auth entry point), and containerized installer (Podman on RHEL). AAP 2.6 adds an automation dashboard (ROI tracking), self-service automation portal, and Ansible Lightspeed intelligent assistant.
RPM-based installer is deprecated as of AAP 2.5 -- containerized and operator-based deployments are the future.
See [${CLAUDE_SKILL_DIR}/references/porting-guide.md] for AAP platform changes and upgrade guidance.
Collections
- Install with
ansible-galaxy collection install community.general - Pin versions in
requirements.ymlusing open ranges:collections: - name: community.general version: ">=7.0.0,<8.0.0" - Always use FQCN in playbooks:
community.general.ufw, notufw - Install collection dependencies before playbook execution in CI:
ansible-galaxy collection install -r requirements.yml - Vendor collections for air-gapped environments:
ansible-galaxy collection download -r requirements.yml -p ./collections/ - Scope collection installs per project -- avoid global installs that create version conflicts across projects
Collection Quality
ansible-test sanity --docker defaultfor coding standards;ansible-lint --profile productionfor certificationgalaxy-importerin CI to replicate automation hub import checks- Semantic versioning (minimum 1.0.0),
requires_ansibleinmeta/runtime.yml - FQCN migration: use
plugin_routinginmeta/runtime.ymlfor backward-compatible redirects
Testing
Pipeline
Integrate ansible-lint in CI and pre-commit hooks. For enterprise environments, add policy-as-code tools (Steampunk
Spotter, Checkov) as gates before automation reaches production.
- ansible-lint -- static analysis in CI and pre-commit hooks
- --syntax-check -- parse without executing
- --check --diff -- dry run against staging
- Molecule -- role-level testing with idempotency verification
- Matrix testing -- multiple OS versions and Ansible versions in CI
- Staging environment -- full run before production
Molecule
Standard role testing framework. Drivers: Docker (fast, local dev), Podman (rootless, enterprise), Vagrant (full VM),
delegated (default in Molecule 6). Run molecule test for the full lifecycle. Use multiple scenarios for different
conditions (default, HA cluster, upgrade).
Performance
- Increase
forks(default 5) for parallel host execution -- start at 2-4x CPU cores, monitor control node memory - Enable SSH pipelining:
pipelining = Truewith ControlPersist - Mitogen strategy plugin: replaces SSH-based execution with RPC protocol, 1.5x-7x faster. Use
mitogen_linearormitogen_free. Most impactful for playbooks with many small tasks. - Cache facts:
gathering = smartwithfact_caching = jsonfile(or Redis) - Disable
gather_factswhen not needed; usegather_subsetto limit scope - Use
synchronizeovercopyfor large file transfers - Install packages as a list, not in a loop
- Use
serialfor staged batching in rolling deployments - Profile before optimizing:
callbacks_enabled = timer, profile_tasks
Large Inventories
Enable inventory caching for dynamic sources (30+ seconds to under 1 second). Use constructed inventory plugin over large static groups. Flatten group hierarchies (3-4 groups per host, not 6-7). Split inventories by function/region.
Porting and Compatibility
ansible-core 2.17+ requires Python 3.7+ on managed hosts. RHEL 8 environments must stay on ansible-core 2.16 (system
Python 3.6 bindings are incompatible). Key removals: yum module (redirected to dnf), include module (use
include_tasks/import_tasks), smart connection option (select explicit plugin).
ansible-core 2.18 removes old-style vars plugins (get_host_vars/get_group_vars) and deprecates plural
COLLECTIONS_PATHS. Windows Server 2012/2012 R2 support is removed.
See [${CLAUDE_SKILL_DIR}/references/porting-guide.md] for the full list of breaking changes, deprecations, and upgrade
strategy.
Application
When writing Ansible automation: apply all conventions silently. If an existing codebase contradicts a convention, follow the codebase and flag the divergence.
When reviewing Ansible code: cite the specific violation and show the fix inline. Example:
copy: -> ansible.builtin.copy:
Integration
The coding skill governs workflow; this skill governs Ansible-specific conventions. Both are active simultaneously.
Non-Negotiable Defaults
- Every task must be idempotent -- run twice, zero changes on second run
- Always use FQCN -- no short module names
- Always name plays, tasks, and blocks
- Never store secrets in plaintext -- use Vault
- Never commit vault password files to version control
- Use
no_log: trueon tasks handling secrets - Use
become: trueat task level, not play level - Use SSH key authentication, not password authentication
- Sign project content with
ansible-signin regulated environments
Idempotency is the highest Ansible virtue. Describe desired state, never command sequences.