通过 kubeadm 安装的 etcd 单节点环境,和 kubernetes 二进制文件部署的多 etcd节点环境,分别进行 etcd 的备份与还原操作。快照备份后,恢复到/var/lib/etcd--restore 目录里,不将默认的/var/lib/etcd 的目录进行覆盖。以下展示操作过程:

单节点备份与还原

单节点以 kubeadm 安装的单节点为例:

节点名称 节点 IP
k8s-master01 192.168.4.101

/etc/kubernetes/manifests/etcd.yaml 文件内容如下:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.4.101:2379
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.4.101:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-watch-progress-notify-interval=5s
    - --initial-advertise-peer-urls=https://192.168.4.101:2380
    - --initial-cluster=k8s-master01=https://192.168.4.101:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.4.101:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.4.101:2380
    - --name=k8s-master01
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.9-0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /health?exclude=NOSPACE&serializable=true
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: etcd
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /health?serializable=false
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priority: 2000001000
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data
status: {}

单节点备份

etcd 的备份,需要用到 3 个证书文件,分别是连接 etcd 的 cert 证书文件、key 密钥文件、受信任的 ca 文件。

这几个文件具体位置可以在 etcd 的/etc/kubernetes/manifests/etcd.yaml 启动文件中找到,在command中的--listen-client-urls获取监听地址,以及--cert-file--key-file--trusted-ca-file获取到证书信息:

--listen-client-urls=https://127.0.0.1:2379,https://192.168.4.101:2379
--cert-file=/etc/kubernetes/pki/etcd/server.crt
--key-file=/etc/kubernetes/pki/etcd/server.key
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

通过以上获取的信息,使用 etcdctl 备份到/opt/etcd-snapshot.db:

ETCDCTL_API=3 etcdctl snapshot save /opt/etcd-snapshot.db \
  --endpoints="https://127.0.0.1:2379" \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

最后提示Snapshot saved at /opt/etcd-snapshot.db表示备份成功。

使用以下命令,可以查看备份的状态:

ETCDCTL_API=3 etcdctl snapshot status /opt/etcd-snapshot.db --write-out=table

单节点还原

停止 kube-apiserver 服务,由于是 kubeadm 启动的 k8s,只需要把 /etc/kubernetes/manifests/kube-apiserver.yaml 文件移动到别的地方即可停止:

sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /opt/

停止后,进行恢复 etcd。以恢复到/var/lib/etcd--restore 目录为例。原来的/var/lib/etcd 目录将不动,启动的时候把 etcd 的数据目录指向/var/lib/etcd--restore 目录即可:

ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-snapshot.db \
  --initial-cluster="k8s-master01=https://192.168.4.101:2380" \
  --initial-cluster-token=etcd-k8s-cluster \
  --initial-advertise-peer-urls=https://192.168.4.101:2380 \
  --name=k8s-master01 \
  --data-dir=/var/lib/etcd--restore \
  --wal-dir=/var/lib/etcd--restore/wal

这一步在以前的版本中或许需要--skip-hash-check来忽略 hash 的检验。

这里需要注意恢复/var/lib/etcd--restore 目录里的 wal 目录位置。如果不指定--wal-dir=,恢复的 wal 是否和原来的/var/lib/etcd 目录中的 wal 目录位置一样。如果不一样就需要再在恢复的时候,增加--wal-dir=参数指定 wal 目录的恢复位置。

如果通过 kubeadm 安装的,则修改/etc/kubernetes/manifests/etcd.yaml 文件中 hostPath 的 etcd-data 的 path 目录位置:

volumes:
  - hostPath:
      path: /var/lib/etcd-restore
      type: DirectoryOrCreate
    name: etcd-data

把原来的path: /var/lib/etcd修改为path: /var/lib/etcd-restore。更改后,kubeadm 会把 etcd 会自动重新启动。

最后,重启启动 kube-apiserver:

mv /opt/kube-apiserver.yaml /etc/kubernetes/manifests/
systemctl restart kubelet

稍等片刻后,即可通过kubectl检查环境。

多节点备份与还原

多节点以二进制文件部署的集群为例,以 3 节点的 etcd,分别是:

节点名称 节点 IP
k8s-master01 192.168.4.11
k8s-master02 192.168.4.12
k8s-master03 192.168.4.13

其中 k8s-master01 的/etc/etcd/etcd.config.yml 示例配置文件如下:

name: 'k8s-master01'
data-dir: /var/lib/etcd
wal-dir: /var/lib/etcd/wal
snapshot-count: 5000
heartbeat-interval: 100
election-timeout: 1000
quota-backend-bytes: 0
listen-peer-urls: 'https://192.168.4.11:2380'
listen-client-urls: 'https://192.168.4.11:2379,http://127.0.0.1:2379'
max-snapshots: 3
max-wals: 5
cors:
initial-advertise-peer-urls: 'https://192.168.4.11:2380'
advertise-client-urls: 'https://192.168.4.11:2379'
discovery:
discovery-fallback: 'proxy'
discovery-proxy:
discovery-srv:
initial-cluster: 'k8s-master01=https://192.168.4.11:2380,k8s-master02=https://192.168.4.12:2380,k8s-master03=https://192.168.4.13:2380'
initial-cluster-token: 'etcd-k8s-cluster'
initial-cluster-state: 'new'
strict-reconfig-check: false
enable-v2: true
enable-pprof: true
proxy: 'off'
proxy-failure-wait: 5000
proxy-refresh-interval: 30000
proxy-dial-timeout: 1000
proxy-write-timeout: 5000
proxy-read-timeout: 0
client-transport-security:
  cert-file: '/etc/kubernetes/pki/etcd/etcd.pem'
  key-file: '/etc/kubernetes/pki/etcd/etcd-key.pem'
  client-cert-auth: true
  trusted-ca-file: '/etc/kubernetes/pki/etcd/etcd-ca.pem'
  auto-tls: true
peer-transport-security:
  cert-file: '/etc/kubernetes/pki/etcd/etcd.pem'
  key-file: '/etc/kubernetes/pki/etcd/etcd-key.pem'
  peer-client-cert-auth: true
  trusted-ca-file: '/etc/kubernetes/pki/etcd/etcd-ca.pem'
  auto-tls: true
debug: false
log-package-levels:
log-outputs: [default]
force-new-cluster: false

多节点备份

多节点备份与单节点一样,只需要备份一个 etcd 节点即可。此示例在 master01 上面操作:

ETCDCTL_API=3 etcdctl snapshot save /opt/etcd-snapshot.db \
  --endpoints="https://192.168.4.11:2379" \
  --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem \
  --cert=/etc/kubernetes/pki/etcd/etcd.pem \
  --key=/etc/kubernetes/pki/etcd/etcd-key.pem

查看备份信息:

ETCDCTL_API=3 etcdctl snapshot status /opt/etcd-snapshot.db --write-out=table

多节点还原

停止所有 Master 上 kube-apiserver 服务:

systemctl stop kube-apiserver

停止集群中的所有 etcd:

systemctl stop etcd

把刚刚备份的快照文件发送给 master02 和 master03 两个节点:

scp /opt/etcd-snapshot.db k8s-master02:/opt
scp /opt/etcd-snapshot.db k8s-master03:/opt

在 master01 节点执行:

ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-snapshot.db \
  --initial-cluster="k8s-master01=https://192.168.4.11:2380,k8s-master02=https://192.168.4.12:2380,k8s-master03=https://192.168.4.13:2380" \
  --initial-cluster-token=etcd-k8s-cluster \
  --initial-advertise-peer-urls=https://192.168.4.11:2380 \
  --name=k8s-master01 \
  --data-dir=/var/lib/etcd--restore  \
  --wal-dir=/var/lib/etcd--restore/wal

在 master02 节点执行:

ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-snapshot.db \
  --initial-cluster="k8s-master01=https://192.168.4.11:2380,k8s-master02=https://192.168.4.12:2380,k8s-master03=https://192.168.4.13:2380" \
  --initial-cluster-token=etcd-k8s-cluster \
  --initial-advertise-peer-urls=https://192.168.4.12:2380 \
  --name=k8s-master02 \
  --data-dir=/var/lib/etcd--restore  \
  --wal-dir=/var/lib/etcd--restore/wal

在 master03 节点执行:

ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-snapshot.db \
  --initial-cluster="k8s-master01=https://192.168.4.11:2380,k8s-master02=https://192.168.4.12:2380,k8s-master03=https://192.168.4.13:2380" \
  --initial-cluster-token=etcd-k8s-cluster \
  --initial-advertise-peer-urls=https://192.168.4.13:2380 \
  --name=k8s-master03 \
  --data-dir=/var/lib/etcd--restore  \
  --wal-dir=/var/lib/etcd--restore/wal

将每个节点的/etc/etcd/etcd.config.yml 文件的data-dirwal-dir位置修改成以下新位置:

data-dir: /var/lib/etcd--restore
wal-dir: /var/lib/etcd--restore/wal

启动集群中的所有 etcd:

systemctl start etcd

启动所有 Master 上 kube-apiserver 服务:

systemctl start kube-apiserver

所有 Master 上 kube-apiserver 服务启动后,即可通过kubectl检查环境。