Managed Kubernetes
Latest
Frequently Asked Questions
Solutions
How Tos
Internal Only
Templates
Powered By

Title
Message
Create new category
What is the title of your new category?
Edit page index title
What is the title of the page index?
Edit category
What is the new title of your category?
Edit link
What is the new title and URL of your link?
How to Restore Cluster from ETCD Backup on New Master Nodes in Case of Loss of All Existing Master Nodes
Copy Markdown
Open in ChatGPT
Open in Claude
Problem
How to restore etcd backup on new master nodes in case of loss of all existing master nodes in the cluster?
Environment
- Platform9 Managed Kubernetes - All Versions
- Docker or Containerd
- ETCD
Procedure
Make sure to have access to the etcd backup file. For working cluster the default path for the backup storage path is /etc/pf9/etcd-backup. Also the worker nodes should be not affected and should be in running state.
- Use Force Remove in the Management Plane UI to deauthorize and remove the old master nodes from Management Plane. Note: For PMK v5.3 and below please contact Platform9 Support to Force Remove the nodes from Management Plane.
- Create new master nodes with the same IP address and hostname as that of the removed master nodes.
- Onboard the nodes to Management Plane using either the Platform9 CLI or through downloading and installing the Platform9 HostAgent manually.
- Once nodes comes online, first add the same active master(leader) to the cluster using below API.
Get TOKEN, PROJECT_ID by following steps mentioned in Keystone Identity. And the NODE_UUID and CLUSTER_ID can be found from the Management Plane UI by enabling UUID in columns.
API
xxxxxxxxxx# curl -k -X POST -H "X-Auth-Token: $TOKEN" -H "Content-type: application/json" -d '[ { "uuid": "<NODE_UUID>", "isMaster": 1 }]' "https://<DU-FQDN>/qbert/v3/<project_id>/clusters/<cluster_id>/attach"Example
x
# TOKEN=gAAAAABhnQ4ykxkfDvEd5QyhZaqTvXjiysB2xt9hY7QgjpSi_mWdCUMEgbS0w-5DAOVLRTtZeHRrCql9-aWfOFidF-1mjx0yUeIuKBel2wWQ91VsOoPK4Knz9KU-59u73vqhWaD_RibU0-eReaCx7Et1OsJ6aybFcL9ug-Kp_BHTpOxrlqGIeCI # curl -k -X POST -H "X-Auth-Token: $TOKEN" -H "Content-type: application/json" -d '[{ "uuid": "992e976b-0527-4359-bc28-00c697b368eb", "isMaster": 1}]' "https://ajohn.platform9.net/qbert/v3/9dc12eee0b794b9a8e8f8dc90a88f7ec/clusters/601a4510-262d-4551-bf05-395667708957/attach"- Once the master node is attached, the cluster looks healthy with a single master node.
Example v1.19
xxxxxxxxxx# kubectl get nodesNAME STATUS ROLES AGE VERSION10.128.145.204 Ready master 4m36s v1.19.610.128.145.248 Ready worker 4m15s v1.19.6Example v1.20
xxxxxxxxxx# kubectl get nodesNAME STATUS ROLES AGE VERSION10.128.146.162 Ready master 6m10s v1.20.11- Copy the etcd backup to the Master node and copy etcdctl binary.
Docker
xxxxxxxxxx# docker cp etcd:/usr/local/bin/etcdctl /opt/pf9/pf9-kube/bin# export PATH=$PATH:/opt/pf9/pf9-kube/bin- Stop the PMK stack and move the current etcd directory to some other path on the Master node.
Command
xxxxxxxxxx# systemctl stop pf9-{hostagent,nodeletd}# /opt/pf9/nodelet/nodeletd phases stop# mkdir /tmp/etcd_dir_backup# mv /var/opt/pf9/kube/etcd/* /tmp/etcd_dir_backup- Perform etcd DB restore from the master node.
Docker
Containerd
# ETCDCTL_API=3 etcdctl snapshot restore <ETCD_BACKUP.db> --data-dir /var/opt/pf9/kube/etcd/data --initial-advertise-peer-urls="https://<MASTER_IP>:2380" --initial-cluster="<MASTER_NODE_UUID>=https://<MASTER_IP>:2380" --name="<MASTER_NODE_UUID>"Example
# ETCDCTL_API=3 etcdctl snapshot restore /home/centos/etcd_backup.db --data-dir /var/opt/pf9/kube/etcd/data --initial-advertise-peer-urls="https://10.128.145.204:2380" --initial-cluster="992e976b-0527-4359-bc28-00c697b368eb=https://10.128.145.204:2380" --name="992e976b-0527-4359-bc28-00c697b368eb" 2021-11-23 16:08:28.229042 I | etcdserver/membership: added member 343efe1ee1440e81 [https://10.128.145.204:2380] to cluster 90c381d304967556- Once restore is complete, start PMK stack.
Command
xxxxxxxxxx# systemctl start pf9-hostagent- Check the etcd cluster health status post-restoration.
- For cluster running pf9-kube v1.19
Docker
# /opt/pf9/pf9-kube/bin/etcdctl --ca-file /etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.key cluster-health # /opt/pf9/pf9-kube/bin/etcdctl --ca-file /etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.key member listSample
# /opt/pf9/pf9-kube/bin/etcdctl --ca-file /etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.key cluster-healthmember 343efe1ee1440e81 is healthy: got healthy result from https://10.128.145.204:4001cluster is healthy # /opt/pf9/pf9-kube/bin/etcdctl --ca-file /etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key-file /etc/pf9/kube.d/certs/etcdctl/etcd/request.key member list343efe1ee1440e81: name=992e976b-0527-4359-bc28-00c697b368eb peerURLs=https://10.128.145.204:2380 clientURLs=https://10.128.145.204:4001 isLeader=true- For cluster running pf9-kube v1.20
Docker
Containerd
# ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl member list # ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl endpoint health --write-out=table# ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl endpoint status --cluster --write-out=table --endpoints=http://127.0.0.1:2379 --cacert="/etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt" --cert="/etc/pf9/kube.d/certs/etcdctl/etcd/request.crt" --key="/etc/pf9/kube.d/certs/etcdctl/etcd/request.key"Example
# ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl member list b60fda6661d77461 started 70ee9439-06e8-4ad3-8ae2-b2fd5360cbd6 https://10.128.146.162:2380 https://10.128.146.162:4001 false # ETCDCTL_API=3 /opt/pf9/pf9-kube/bin/etcdctl endpoint health --write-out=table +-----------------------------+--------+-------------+-------+| ENDPOINT | HEALTH | TOOK | ERROR |+-----------------------------+--------+-------------+-------+| https://10.128.146.162:4001 | true | 22.107927ms | |+-----------------------------+--------+-------------+-------+- Delete the stale master nodes which are in NotReady state using kubectl.
Example
xxxxxxxxxx# kubectl get nodesNAME STATUS ROLES AGE VERSION10.128.144.242 NotReady master 136m v1.19.610.128.145.204 Ready master 136m v1.19.610.128.145.248 Ready worker 136m v1.19.610.128.145.253 NotReady master 136m v1.19.6 # kubectl delete node 10.128.144.242 10.128.145.253node "10.128.144.242" deletednode "10.128.145.253" deleted- Scale the master nodes one by one from the Management Plane UI and verify the etcd cluster health using Step 10.
Restoring from the etcd backup is a complicated process, if not familiar with it we recommend involving Platform9 Support for any assistance before initiating the restore operation.
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches
Last updated on
Was this page helpful?
Next to read:
How To Enable Verbose Logging for Kubelet on a Node?Discard Changes
Do you want to discard your current changes and overwrite with the template?
Archive Synced Block
Message
Create new Template
What is this template's title?
Delete Template
Message