name: k3s-ops description: | K3s lightweight Kubernetes cluster deployment and operations skill. Supports automatic K3s cluster deployment (Server + Agent nodes), cluster upgrade, certificate management, backup restore, troubleshooting, and daily maintenance. 触发场景:K3s 安装、K3s 部署、K3s 升级、K3s 维护、K3s 故障、轻量集群、边缘集群、 集群初始化、k3s server、k3s agent、集群搭建、节点加入。 allowed-tools: linux_execute_command linux_get_system_info linux_get_service_status linux_service_control linux_upload_script linux_write_file linux_read_file pods_list nodes_list events_list
K3s Cluster Deployment and Operations Guide
Reference: https://github.com/k3s-io/k3s | https://docs.k3s.io
I. Automated K3s Cluster Deployment
Prerequisites
- Target host has SSH access configured (via Linux MCP)
- System requirements: Linux 64-bit (recommended Ubuntu 20.04+/CentOS 7+/RHEL 8+)
- Minimum 512MB RAM (Server), recommended 2GB+
- Network connectivity, Server node exposes port 6443
Step 1: Environment Check
Use linux_get_system_info(host) to verify:
- OS and kernel version
- CPU/memory resources meet minimum requirements
- Network connectivity (ping between nodes)
- Firewall status
# Check system requirements
uname -a
free -h
df -h
# Check if port 6443 is in use
ss -tlnp | grep 6443
# Check firewall
systemctl status firewalld 2>/dev/null || ufw status 2>/dev/null
Step 2: Deploy K3s Server (Master Node)
Use linux_execute_command or linux_upload_script to run the install script:
# Install K3s Server
curl -sfL https://get.k3s.io | sh -s - server \
--write-kubeconfig-mode 644 \
--disable traefik \
--disable servicelb \
--tls-san <SERVER_IP_OR_DOMAIN>
Common install options:
| Option | Description | Example |
|---|---|---|
--write-kubeconfig-mode 644 |
kubeconfig file permissions | Allow non-root read |
--disable traefik |
Disable built-in Traefik | Use custom Ingress Controller |
--disable servicelb |
Disable built-in ServiceLB | Use MetalLB |
--tls-san |
API Server extra SAN | Domain or external IP |
--data-dir |
Data directory | Custom storage path |
--cluster-init |
Enable embedded etcd | HA mode |
--flannel-backend=none |
Disable Flannel | Use Calico/Cilium |
Step 3: Get Node Token
cat /var/lib/rancher/k3s/server/node-token
Step 4: Deploy K3s Agent (Worker Node)
curl -sfL https://get.k3s.io | K3S_URL=https://<SERVER_IP>:6443 \
K3S_TOKEN=<NODE_TOKEN> sh -s - agent
Step 5: Verify Cluster
k3s kubectl get nodes
k3s kubectl get pods -A
k3s kubectl cluster-info
II. High Availability Deployment
Embedded etcd Mode (3 Server Nodes)
# First Server node
curl -sfL https://get.k3s.io | sh -s - server \
--cluster-init \
--tls-san <VIP_OR_LB_IP>
# Get token
cat /var/lib/rancher/k3s/server/node-token
# Second and third Server nodes
curl -sfL https://get.k3s.io | K3S_TOKEN=<TOKEN> sh -s - server \
--server https://<FIRST_SERVER_IP>:6443 \
--tls-san <VIP_OR_LB_IP>
III. Cluster Upgrade
Manual Upgrade
# Server node (upgrade one by one)
curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable sh -
# Agent node
curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable \
K3S_URL=https://<SERVER_IP>:6443 K3S_TOKEN=<TOKEN> sh -
Upgrade Notes
- Upgrade Server nodes first, then Agent nodes
- Upgrade one at a time, verify node is Ready before next
- In HA mode ensure at least one Server node is available
- Take etcd snapshot backup before upgrade
IV. Backup and Restore
etcd Snapshot
# Manual snapshot
k3s etcd-snapshot save --name pre-upgrade-$(date +%Y%m%d)
# List snapshots
k3s etcd-snapshot ls
# Restore from snapshot (run after stopping k3s)
systemctl stop k3s
k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<snapshot>
systemctl start k3s
Automated Snapshot Config
# /etc/rancher/k3s/config.yaml
etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 5
V. Daily Maintenance
K3s Service Management
systemctl status k3s # Server status
systemctl status k3s-agent # Agent status
systemctl restart k3s # Restart Server
systemctl restart k3s-agent # Restart Agent
journalctl -u k3s -f # Server logs
journalctl -u k3s-agent -f # Agent logs
Certificate Management
# Check certificate expiry
for cert in /var/lib/rancher/k3s/server/tls/*.crt; do
echo "=== $cert ==="; openssl x509 -in "$cert" -noout -enddate
done
# K3s auto-rotates certificates (restart suffices)
systemctl restart k3s
Cluster Cleanup
# Uninstall K3s Server
/usr/local/bin/k3s-uninstall.sh
# Uninstall K3s Agent
/usr/local/bin/k3s-agent-uninstall.sh
VI. Troubleshooting
K3s Server Won't Start
- Check service status:
systemctl status k3s - View logs:
journalctl -u k3s --no-pager -n 200 - Common causes:
- Port in use (6443, 10250)
- Data directory permission issues
- etcd data corruption (restore from snapshot)
Node NotReady
- Check Agent service:
systemctl status k3s-agent - Check connectivity:
k3s kubectl get nodes - Check kubelet logs:
journalctl -u k3s-agent -n 100 - Common causes:
- Server unreachable (network/firewall)
- Invalid Node Token
- Certificate expired
Pod Issues
- Use K8s MCP:
pods_list(namespace="all") - View events:
events_list(namespace="all") - View logs:
pods_logs(namespace, name)
Network Issues
# Check Flannel
k3s kubectl get pods -n kube-system | grep flannel
# Check CoreDNS
k3s kubectl get pods -n kube-system | grep coredns
# Check Service CIDR and Pod CIDR
k3s kubectl cluster-info dump | grep -i cidr
VII. Key File Paths
| Path | Description |
|---|---|
/etc/rancher/k3s/k3s.yaml |
kubeconfig |
/etc/rancher/k3s/config.yaml |
K3s config file |
/var/lib/rancher/k3s/ |
Data directory |
/var/lib/rancher/k3s/server/node-token |
Node Token |
/var/lib/rancher/k3s/server/tls/ |
TLS certificates |
/var/lib/rancher/k3s/server/db/ |
Embedded DB (SQLite/etcd) |
/var/log/syslog or journalctl -u k3s |
K3s logs |
/usr/local/bin/k3s |
K3s binary |
/usr/local/bin/kubectl |
kubectl symlink |
/usr/local/bin/crictl |
crictl symlink |