k3s-ops - SKILL.md Agent Skill

name: k3s-ops description: | K3s lightweight Kubernetes cluster deployment and operations skill. Supports automatic K3s cluster deployment (Server + Agent nodes), cluster upgrade, certificate management, backup restore, troubleshooting, and daily maintenance. 触发场景：K3s 安装、K3s 部署、K3s 升级、K3s 维护、K3s 故障、轻量集群、边缘集群、集群初始化、k3s server、k3s agent、集群搭建、节点加入。 allowed-tools: linux_execute_command linux_get_system_info linux_get_service_status linux_service_control linux_upload_script linux_write_file linux_read_file pods_list nodes_list events_list

K3s Cluster Deployment and Operations Guide

Reference: https://github.com/k3s-io/k3s | https://docs.k3s.io

I. Automated K3s Cluster Deployment

Prerequisites

Target host has SSH access configured (via Linux MCP)
System requirements: Linux 64-bit (recommended Ubuntu 20.04+/CentOS 7+/RHEL 8+)
Minimum 512MB RAM (Server), recommended 2GB+
Network connectivity, Server node exposes port 6443

Step 1: Environment Check

Use linux_get_system_info(host) to verify:

OS and kernel version
CPU/memory resources meet minimum requirements
Network connectivity (ping between nodes)
Firewall status

# Check system requirements
uname -a
free -h
df -h
# Check if port 6443 is in use
ss -tlnp | grep 6443
# Check firewall
systemctl status firewalld 2>/dev/null || ufw status 2>/dev/null

Step 2: Deploy K3s Server (Master Node)

Use linux_execute_command or linux_upload_script to run the install script:

# Install K3s Server
curl -sfL https://get.k3s.io | sh -s - server \
  --write-kubeconfig-mode 644 \
  --disable traefik \
  --disable servicelb \
  --tls-san <SERVER_IP_OR_DOMAIN>

Common install options:

Option	Description	Example
`--write-kubeconfig-mode 644`	kubeconfig file permissions	Allow non-root read
`--disable traefik`	Disable built-in Traefik	Use custom Ingress Controller
`--disable servicelb`	Disable built-in ServiceLB	Use MetalLB
`--tls-san`	API Server extra SAN	Domain or external IP
`--data-dir`	Data directory	Custom storage path
`--cluster-init`	Enable embedded etcd	HA mode
`--flannel-backend=none`	Disable Flannel	Use Calico/Cilium

Step 3: Get Node Token

cat /var/lib/rancher/k3s/server/node-token

Step 4: Deploy K3s Agent (Worker Node)

curl -sfL https://get.k3s.io | K3S_URL=https://<SERVER_IP>:6443 \
  K3S_TOKEN=<NODE_TOKEN> sh -s - agent

Step 5: Verify Cluster

k3s kubectl get nodes
k3s kubectl get pods -A
k3s kubectl cluster-info

II. High Availability Deployment

Embedded etcd Mode (3 Server Nodes)

# First Server node
curl -sfL https://get.k3s.io | sh -s - server \
  --cluster-init \
  --tls-san <VIP_OR_LB_IP>

# Get token
cat /var/lib/rancher/k3s/server/node-token

# Second and third Server nodes
curl -sfL https://get.k3s.io | K3S_TOKEN=<TOKEN> sh -s - server \
  --server https://<FIRST_SERVER_IP>:6443 \
  --tls-san <VIP_OR_LB_IP>

III. Cluster Upgrade

Manual Upgrade

# Server node (upgrade one by one)
curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable sh -

# Agent node
curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable \
  K3S_URL=https://<SERVER_IP>:6443 K3S_TOKEN=<TOKEN> sh -

Upgrade Notes

Upgrade Server nodes first, then Agent nodes
Upgrade one at a time, verify node is Ready before next
In HA mode ensure at least one Server node is available
Take etcd snapshot backup before upgrade

IV. Backup and Restore

etcd Snapshot

# Manual snapshot
k3s etcd-snapshot save --name pre-upgrade-$(date +%Y%m%d)

# List snapshots
k3s etcd-snapshot ls

# Restore from snapshot (run after stopping k3s)
systemctl stop k3s
k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<snapshot>
systemctl start k3s

Automated Snapshot Config

# /etc/rancher/k3s/config.yaml
etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 5

V. Daily Maintenance

K3s Service Management

systemctl status k3s          # Server status
systemctl status k3s-agent     # Agent status
systemctl restart k3s          # Restart Server
systemctl restart k3s-agent    # Restart Agent
journalctl -u k3s -f           # Server logs
journalctl -u k3s-agent -f     # Agent logs

Certificate Management

# Check certificate expiry
for cert in /var/lib/rancher/k3s/server/tls/*.crt; do
  echo "=== $cert ==="; openssl x509 -in "$cert" -noout -enddate
done

# K3s auto-rotates certificates (restart suffices)
systemctl restart k3s

Cluster Cleanup

# Uninstall K3s Server
/usr/local/bin/k3s-uninstall.sh

# Uninstall K3s Agent
/usr/local/bin/k3s-agent-uninstall.sh

VI. Troubleshooting

K3s Server Won't Start

Check service status: systemctl status k3s
View logs: journalctl -u k3s --no-pager -n 200
Common causes:
- Port in use (6443, 10250)
- Data directory permission issues
- etcd data corruption (restore from snapshot)

Node NotReady

Check Agent service: systemctl status k3s-agent
Check connectivity: k3s kubectl get nodes
Check kubelet logs: journalctl -u k3s-agent -n 100
Common causes:
- Server unreachable (network/firewall)
- Invalid Node Token
- Certificate expired

Pod Issues

Use K8s MCP: pods_list(namespace="all")
View events: events_list(namespace="all")
View logs: pods_logs(namespace, name)

Network Issues

# Check Flannel
k3s kubectl get pods -n kube-system | grep flannel
# Check CoreDNS
k3s kubectl get pods -n kube-system | grep coredns
# Check Service CIDR and Pod CIDR
k3s kubectl cluster-info dump | grep -i cidr

VII. Key File Paths

Path	Description
`/etc/rancher/k3s/k3s.yaml`	kubeconfig
`/etc/rancher/k3s/config.yaml`	K3s config file
`/var/lib/rancher/k3s/`	Data directory
`/var/lib/rancher/k3s/server/node-token`	Node Token
`/var/lib/rancher/k3s/server/tls/`	TLS certificates
`/var/lib/rancher/k3s/server/db/`	Embedded DB (SQLite/etcd)
`/var/log/syslog` or `journalctl -u k3s`	K3s logs
`/usr/local/bin/k3s`	K3s binary
`/usr/local/bin/kubectl`	kubectl symlink
`/usr/local/bin/crictl`	crictl symlink