前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >实践:Kubernetes环境中Etcd集群的备份与恢复

实践:Kubernetes环境中Etcd集群的备份与恢复

作者头像
DevOps云学堂
发布2023-08-22 08:46:25
1.7K0
发布2023-08-22 08:46:25
举报
文章被收录于专栏:DevOps持续集成
今天是「DevOps云学堂」与你共同进步的第 49

第⑦期DevOps实战训练营· 7月15日已开营

实践环境升级基于K8s和ArgoCD

这篇文章我们将进行Kubernetes集群的核心组件 etcd 集群备份,然后在具有一个主节点和一个从节点的 kubernetes 集群中恢复相同的备份。下面是实验的步骤和效果验证。

Step1 安装ETCD客户端

安装etcd cli 客户端, 管理etcd集群。这里在Ubuntu系统中安装。

代码语言:javascript
复制
apt install etcd-client

Step2 创建Nginx部署

我们将创建具有多个副本的 nginx 部署,这些副本将用于验证 etcd 数据的恢复。

代码语言:javascript
复制
kubectl create deployment nginx — image nginx --replicas=5

验证新部署的 Pod 是否处于运行状态

代码语言:javascript
复制
controlplane $ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-77b4fdf86c-6m8gl   1/1     Running   0          50s
nginx-77b4fdf86c-bfcsr   1/1     Running   0          50s
nginx-77b4fdf86c-bqmqk   1/1     Running   0          50s
nginx-77b4fdf86c-nkh7j   1/1     Running   0          50s
nginx-77b4fdf86c-x946x   1/1     Running   0          50s

Step3 备份Etcd集群

为 etcd 备份创建一个备份目录mkdir etcd-backup运行以下命令进行 etcd 备份。

代码语言:javascript
复制
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
                      --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                      --cert=/etc/kubernetes/pki/etcd/server.crt \
                      --key=/etc/kubernetes/pki/etcd/server.key \
snapshot save ./etcd-backup/etcdbackup.db

请注意,您不需要记住上述命令的证书路径,您可以从 kube-system 命名空间中运行的 etcd pod 获取证书路径。您可以通过运行以下命令来为 pod 运行命令

代码语言:javascript
复制
controlplane $ kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS      AGE
calico-kube-controllers-784cc4bcb7-xk6q7   1/1     Running   4             38d
canal-9nszc                                2/2     Running   0             42m
canal-brzd7                                2/2     Running   0             42m
coredns-5d769bfcf4-5mwkn                   1/1     Running   0             38d
coredns-5d769bfcf4-w4xs7                   1/1     Running   0             38d
etcd-controlplane                          1/1     Running   0             38d
kube-apiserver-controlplane                1/1     Running   2             38d
kube-controller-manager-controlplane       1/1     Running   3 (41m ago)   38d
kube-proxy-5b8sx                           1/1     Running   0             38d
kube-proxy-5qlc5                           1/1     Running   0             38d
kube-scheduler-controlplane                1/1     Running   3 (41m ago)   38d

现在运行 get pods -o yaml 命令来获取 etcd pod 的容器命令。

代码语言:javascript
复制
kubectl get pods etcd-controlplane -o yaml -n kube-system

将得到它并可以获得所有证书路径。

代码语言:javascript
复制
containers:
  - command:
    - etcd
    - --advertise-client-urls=https://172.30.1.2:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-watch-progress-notify-interval=5s
    - --initial-advertise-peer-urls=https://172.30.1.2:2380
    - --initial-cluster=controlplane=https://172.30.1.2:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://172.30.1.2:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://172.30.1.2:2380
    - --name=controlplane
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

Step4 验证备份数据

运行以下命令,以从新备份数据中获取密钥列表和详细信息ETCDCTL_API=3 etcdctl --write-out=table snapshot status ./etcd-backup/etcdbackup.db

代码语言:javascript
复制
controlplane $ ETCDCTL_API=3 etcdctl --write-out=table snapshot status ./etcd-backup/etcdbackup.db 
+---------+----------+------------+------------+
|  HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+---------+----------+------------+------------+
| cb4c04c |     4567 |       1346 |     6.0 MB |
+---------+----------+------------+------------+

Step5 将备份恢复到集群

在这里,我们将删除之前创建的 nginx 部署,然后恢复备份,以便恢复 nginx 部署。

A.删除nginx部署

代码语言:javascript
复制
controlplane $ kubectl delete deploy nginx
deployment.apps "nginx" deleted

B.将数据从备份恢复

代码语言:javascript
复制
ETCDCTL_API=3 etcdctl snapshot restore etcd-backup/etcdbackup.db

这将创建一个名为的default.etcd文件夹, 恢复备份时您可能会遇到如下错误:

代码语言:javascript
复制
controlplane $ ETCDCTL_API=3 etcdctl snapshot restore etcd-backup/etcdbackup.db 
Error:  expected sha256 [253 81 3 207 182 43 249 52 218 166 71 135 221 106 6 216 216 21 183 250 36 126 187 251 171 98 91 69 113 40 229 2], got [63 25 34 167 139 91 18 135 249 179 157 115 214 138 237 35 161 237 175 12 61 31 141 130 204 146 143 177 132 241 193 15]

为了避免这种情况,您可以在上面的恢复命令中使用--skip-hash-check=true此标志,您应该可以很好地获取default.etcd当前路径上的文件夹。

代码语言:javascript
复制
controlplane $ ETCDCTL_API=3 etcdctl snapshot restore etcd-backup/etcdbackup.db --skip-hash-check=true
2023-06-28 15:35:36.180956 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
controlplane $ ls
default.etcd  etcd-backup  filesystem

C.现在我们需要停止所有正在运行的 Kubernetes 组件以更新 etcd 数据。为此,我们在/etc/kubernetes/manifests/文件夹中放置了 kubernetes 组件的清单文件,我们将临时将此文件移出此路径,kubelet 将自动删除这些 pod。

代码语言:javascript
复制
controlplane $ ls /etc/kubernetes/manifests/
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

controlplane $ kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-784cc4bcb7-xk6q7   1/1     Running   4          38d
canal-5lxjg                                2/2     Running   0          28m
canal-zv77t                                2/2     Running   0          28m
coredns-5d769bfcf4-5mwkn                   1/1     Running   0          38d
coredns-5d769bfcf4-w4xs7                   1/1     Running   0          38d
etcd-controlplane                          1/1     Running   0          38d
kube-apiserver-controlplane                1/1     Running   2          38d
kube-controller-manager-controlplane       1/1     Running   2          38d
kube-proxy-5b8sx                           1/1     Running   0          38d
kube-proxy-5qlc5                           1/1     Running   0          38d
kube-scheduler-controlplane                1/1     Running   2          38d

controlplane $ mkdir temp_yaml_files

controlplane $ mv /etc/kubernetes/manifests/* temp_yaml_files/

controlplane $ kubectl get pods -n kube-system
The connection to the server 172.30.1.2:6443 was refused - did you specify the right host or port?

您可以在上面看到,一旦我们从清单路径中删除文件,api-server pod 将被终止,您将无法访问集群。你可以检查这些组件的docker容器是否被Kill或处于运行状态。在移动文件之前,容器将运行。

代码语言:javascript
复制
controlplane $ crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID              POD
6a2bce359c15b       6f6e73fa8162b       3 seconds ago       Running             kube-apiserver            0                   fe1be6aa651dd       kube-apiserver-controlplane
a26534b2e6244       c6b5118178229       4 seconds ago       Running             kube-controller-manager   0                   38fb48a4ebb62       kube-controller-manager-controlplane
58ac164968ec3       86b6af7dd652c       4 seconds ago       Running             etcd                      0                   170af0e603a02       etcd-controlplane
e98ef4185206b       6468fa8f98696       4 seconds ago       Running             kube-scheduler            0                   0bd26fd661a2c       kube-scheduler-controlplane
7a03436be6ce6       f9c3c1813269c       23 seconds ago      Running             calico-kube-controllers   7                   6da32eed5e939       calico-kube-controllers-784cc4bcb7-xk6q7
1edf2a857f1d4       e6ea68648f0cd       31 minutes ago      Running             kube-flannel              0                   3dac4c0c5960d       canal-5lxjg
e249d3e4b2b51       75392e3500e36       31 minutes ago      Running             calico-node               0                   3dac4c0c5960d       canal-5lxjg
039999604ba8c       ead0a4a53df89       5 weeks ago         Running             coredns                   0                   f8b31a08b4907       coredns-5d769bfcf4-5mwkn
26d7a0bc1b1b9       1780fa6665ff0       5 weeks ago         Running             local-path-provisioner    0                   1913e8d9cb757       local-path-provisioner-bf548cc96-fchvw
c86359e6bf649       fbe39e5d66b6a       5 weeks ago         Running   

一旦文件被移动,它们将被终止。

代码语言:javascript
复制
controlplane $ mv /etc/kubernetes/manifests/* temp_yaml_files/
controlplane $ crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID              POD
7a03436be6ce6       f9c3c1813269c       2 minutes ago       Running             calico-kube-controllers   7                   6da32eed5e939       calico-kube-controllers-784cc4bcb7-xk6q7
1edf2a857f1d4       e6ea68648f0cd       34 minutes ago      Running             kube-flannel              0                   3dac4c0c5960d       canal-5lxjg
e249d3e4b2b51       75392e3500e36       34 minutes ago      Running             calico-node               0                   3dac4c0c5960d       canal-5lxjg
039999604ba8c       ead0a4a53df89       5 weeks ago         Running             coredns                   0                   f8b31a08b4907       coredns-5d769bfcf4-5mwkn
26d7a0bc1b1b9       1780fa6665ff0       5 weeks ago         Running             local-path-provisioner    0                   1913e8d9cb757       local-path-provisioner-bf548cc96-fchvw
c86359e6bf649       fbe39e5d66b6a       5 weeks ago         Running             kube-proxy                0                   d69f1cd083173       kube-proxy-5b8sx

D.现在 api-server/controller-manager/kube-scheduler 已终止,我们将把数据从default.etcd文件夹移动到 etcd data-dir,我们可以从第 3 阶段获取该数据,在阶段 3 中,我们在 etcd pod 中运行 etcd 命令,并且设置了 data-dir到--data-dir=/var/lib/etcd.

代码语言:javascript
复制
controlplane $ cd default.etcd/
controlplane $ ls
member
controlplane $ ls /var/lib/etcd
member

我们将从备份目录中重命名并添加member文件夹/var/lib/etcd/member。备份默认/var/lib/etcd/目录中的member 到文件夹/var/lib/etcd/member.bak

代码语言:javascript
复制
controlplane $ cd default.etcd/
controlplane $ ls
member
controlplane $ mv /var/lib/etcd/member/ /var/lib/etcd/member.bak
controlplane $ mv  member/ /var/lib/etcd/
controlplane $ ls /var/lib/etcd
member  member.bak

E. 现在,由于我们的数据已恢复,我们将停止 kubelet 服务并将 yaml 文件再次移动到清单文件夹。

代码语言:javascript
复制
controlplane $ systemctl stop kubelet
controlplane $ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: inactive (dead) since Wed 2023-06-28 16:03:32 UTC; 6s ago
       Docs: https://kubernetes.io/docs/home/
    Process: 25011 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, stat>
   Main PID: 25011 (code=exited, status=0/SUCCESS)

Jun 28 16:03:30 controlplane kubelet[25011]: E0628 16:03:30.524978   25011 controller.go:146] "Failed to ensure lease exists, will retry" err="Get \"htt>
Jun 28 16:03:31 controlplane kubelet[25011]: I0628 16:03:31.195933   25011 status_manager.go:809] "Failed to get status for pod" podUID=4ad6dc12-6828-45>
Jun 28 16:03:31 controlplane kubelet[25011]: E0628 16:03:31.196843   25011 mirror_client.go:138] "Failed deleting a mirror pod" err="Delete \"https://17>
Jun 28 16:03:31 controlplane kubelet[25011]: E0628 16:03:31.197110   25011 mirror_client.go:138] "Failed deleting a mirror pod" err="Delete \"https://17>
Jun 28 16:03:31 controlplane kubelet[25011]: E0628 16:03:31.197392   25011 mirror_client.go:138] "Failed deleting a mirror pod" err="Delete \"https://17>
Jun 28 16:03:31 controlplane kubelet[25011]: E0628 16:03:31.197721   25011 mirror_client.go:138] "Failed deleting a mirror pod" err="Delete \"https://17>
Jun 28 16:03:32 controlplane systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Jun 28 16:03:32 controlplane kubelet[25011]: I0628 16:03:32.098579   25011 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bun>
Jun 28 16:03:32 controlplane systemd[1]: kubelet.service: Succeeded.
Jun 28 16:03:32 controlplane systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
lines 1-19/19 (END)

controlplane $ mv temp_yaml_files/* /etc/kubernetes/manifests/
controlplane $ ls /etc/kubernetes/manifests/
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

一旦这些文件被移动,我们将启动 kubelet 服务,以便它选择这些文件并部署组件。

代码语言:javascript
复制
controlplane $ systemctl start kubelet
controlplane $ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since Wed 2023-06-28 16:05:56 UTC; 3s ago
       Docs: https://kubernetes.io/docs/home/
   Main PID: 60741 (kubelet)
      Tasks: 9 (limit: 2339)
     Memory: 70.5M
     CGroup: /system.slice/kubelet.service
             └─60741 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/>

Jun 28 16:05:57 controlplane kubelet[60741]: W0628 16:05:57.729886   60741 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to>
Jun 28 16:05:57 controlplane kubelet[60741]: E0628 16:05:57.729952   60741 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to>
Jun 28 16:05:57 controlplane kubelet[60741]: W0628 16:05:57.831598   60741 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to>
Jun 28 16:05:57 controlplane kubelet[60741]: E0628 16:05:57.832204   60741 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to>
Jun 28 16:05:58 controlplane kubelet[60741]: W0628 16:05:58.130322   60741 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to>
Jun 28 16:05:58 controlplane kubelet[60741]: E0628 16:05:58.130397   60741 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to>
Jun 28 16:05:58 controlplane kubelet[60741]: E0628 16:05:58.274435   60741 controller.go:146] "Failed to ensure lease exists, will retry" err="Get \"htt>
Jun 28 16:05:58 controlplane kubelet[60741]: I0628 16:05:58.360755   60741 kubelet_node_status.go:70] "Attempting to register node" node="controlplane"
Jun 28 16:05:58 controlplane kubelet[60741]: E0628 16:05:58.361160   60741 kubelet_node_status.go:92] "Unable to register node with API server" err="Pos>
Jun 28 16:05:59 controlplane kubelet[60741]: I0628 16:05:59.962674   60741 kubelet_node_status.go:70] "Attempting to register node" node="controlplane"
lines 1-22/22 (END)

您现在可以看到容器现在再次运行, kubectl 命令可能需要几分钟才能工作。

代码语言:javascript
复制
crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID              POD
688cfa2890b4f       f9c3c1813269c       23 seconds ago      Running             calico-kube-controllers   12                  6da32eed5e939       calico-kube-controllers-784cc4bcb7-xk6q7
db1797e3e2e83       6468fa8f98696       28 seconds ago      Running             kube-scheduler            0                   307a1600b4346       kube-scheduler-controlplane
1dc176c2a599e       c6b5118178229       28 seconds ago      Running             kube-controller-manager   0                   f9efc6c4c8d91       kube-controller-manager-controlplane
f70e2103ec1e0       6f6e73fa8162b       29 seconds ago      Running             kube-apiserver            0                   32f49c141ea69       kube-apiserver-controlplane
2e274f5176656       86b6af7dd652c       29 seconds ago      Running             etcd                      0                   9c561113f9fcd       etcd-controlplane
1edf2a857f1d4       e6ea68648f0cd       47 minutes ago      Running             kube-flannel              0                   3dac4c0c5960d       canal-5lxjg
e249d3e4b2b51       75392e3500e36       47 minutes ago      Running             calico-node               0                   3dac4c0c5960d       canal-5lxjg
039999604ba8c       ead0a4a53df89       5 weeks ago         Running             coredns                   0                   f8b31a08b4907       coredns-5d769bfcf4-5mwkn
26d7a0bc1b1b9       1780fa6665ff0       5 weeks ago         Running             local-path-provisioner    0                   1913e8d9cb757       local-path-provisioner-bf548cc96-fchvw
c86359e6bf649       fbe39e5d66b6a       5 weeks ago         Running 

您现在可以通过运行 get pods 命令来验证我们的 nginx 部署是否已恢复(我们在备份后删除了该部署)

代码语言:javascript
复制
controlplane $ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-77b4fdf86c-8n7kg   1/1     Running   0          40m
nginx-77b4fdf86c-gmbjm   1/1     Running   0          40m
nginx-77b4fdf86c-pjpnr   1/1     Running   0          40m
nginx-77b4fdf86c-qjxmd   1/1     Running   0          40m
nginx-77b4fdf86c-zhvnv   1/1     Running   0          40m

恭喜!!!您现在已成功恢复 ETCD 数据。

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2023-08-02,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 DevOps云学堂 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Step1 安装ETCD客户端
  • Step2 创建Nginx部署
  • Step3 备份Etcd集群
  • Step4 验证备份数据
  • Step5 将备份恢复到集群
相关产品与服务
容器服务
腾讯云容器服务(Tencent Kubernetes Engine, TKE)基于原生 kubernetes 提供以容器为核心的、高度可扩展的高性能容器管理服务,覆盖 Serverless、边缘计算、分布式云等多种业务部署场景,业内首创单个集群兼容多种计算节点的容器资源管理模式。同时产品作为云原生 Finops 领先布道者,主导开源项目Crane,全面助力客户实现资源优化、成本控制。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档