
Title
Create new category
Edit page index title
Edit category
Edit link
Restore Certificates Manually for On-Prem Deployments [Internal Only]
Problem
On an On-Prem Platform9 Managed Kubernetes deployment, the management plane (DU) may be unavailable at times and a cluster node may need to be rebooted or have the PMK stack restarted on it. The reboot fails at the generating certificates step if the management plane is unavailable.
Environment
- Platform9 Edge Cloud - v5.3 LTS Patch 10 and Lower
Procedure
Pre-conditions:
- The management plane VM is shut down or offline.
- The cluster node had been initializing while the above conditions were true and thus has failed to start the PMK stack (nodeletd phases) at the gen_certs step.
Steps:
- On the cluster node, stop the pf9-hostagent and pf9-nodeletd services.
xxxxxxxxxx# systemctl stop pf9-hostagent pf9-nodeletd- Stop the PMK stack on the cluster node.
xxxxxxxxxx# /opt/pf9/nodelet/nodeletd phases stop- Restore the certificates from backup by using the script provided below (in Additional Information).
- Start nodeletd phases on the cluster node.
xxxxxxxxxx# /opt/pf9/nodelet/nodeletd phases start- Start pf9-hostagent service on the cluster node.
xxxxxxxxxx# systemctl start pf9-hostagentAdditional Information
Refer Zendesk Ticket: 1352090
Restore Certificate Script:
xxxxxxxxxx#!/usr/bin/env bashCONFIG_DIR=${CONFIG_DIR:-/etc/pf9}source $CONFIG_DIR/kube.envif [ -e "$CONFIG_DIR/kube_override.env" ]; then source $CONFIG_DIR/kube_override.envfiKUBE_DIR=$CONFIG_DIR/kube.dcert_dir=$KUBE_DIR/certsnodelet_running=`systemctl is-active pf9-nodeletd`if [ "$nodelet_running" == "active" ]; then echo "pf9-nodeletd is running. Please stop the service before running this script." echo "systemctl stop pf9-hostagent pf9-nodeletd" exit 1fiif [ -d $cert_dir ]; then echo "Actual cert dir: $cert_dir already present. Exiting!" exit 1fivcert_dir_name=`ls -ltr $KUBE_DIR | grep 'certs.teardown' | head -1 | awk -F ' ' '{print $NF}' | xargs`if [ "x$cert_dir_name" == "x" ]; then echo "No certificate backup directories found" exit 1fibackup_cert_dir=$KUBE_DIR/$cert_dir_name# kubelet certs get created on both worker and masterkubelet_ca_cert=$backup_cert_dir/kubelet/server/ca.crtbackup_done_file=$backup_cert_dir/.doneif ! [ -f $backup_done_file ]; then echo "Unable to determine if latest backup of certs were generated correctly" exit 1fiexit_code=$(grep -q "CLUSTER_ROLE ${ROLE}" $backup_done_file; echo $?)if ! [ $exit_code -eq 0 ]; then echo "Unable to validate if certs were generated for current role" exit 1fiexit_code=$(grep -q "CLUSTER_ID ${CLUSTER_ID}" $backup_done_file; echo $?)if ! [ $exit_code -eq 0 ]; then echo "Unable to validate if certs were generated for current cluster" exit 1fiif ! [ -f $kubelet_ca_cert ]; then echo "Kubelet ca cert file missing" exit 1fi# check if cert end date is at least greater than nowcert_date=`openssl x509 -noout -enddate -in $kubelet_ca_cert | awk -F '=' '{print $2}'`epoch_cert_date=`date -d "$cert_date" +%s`epoch_curr_date=`date +%s`if [ $epoch_cert_date -lt $epoch_curr_date ]; then echo "Cert end date is less than current date" exit 1fiecho "Validations complete."echo "Latest certificate to restore from: $backup_cert_dir"echo "Backed up certs expiring on: $cert_date"echo "Proceed with restore (y/n) [default: n]?"read choiceif [ "x$choice" != "xy" ]; then echo "Exiting on user input." exit 1ficp -pr $backup_cert_dir $cert_direcho "Certificates restored successfully!"Note: Existing certificates are backed up on the worker node automatically when the PMK stack is stopped.
Starting 5.3 LTS Patch 11, the ability to skip gen_certs phase when the management plane is offline and the PMK stack is restarted on a node using nodelet phases is added. This resolves the issue when the management plane is offline and the PMK stack is restarted explicitly using nodeletd phases restart, where in on stop action previously the phase gen_certs was called which removed the certificates, and on start action, it failed to fetch them as the management plane was unavailable.
Note - Also, after a node reboot, the pf9-nodeletd service will skip running the stop and start functions of the gen_certs phase to avoid reaching out to the management plane.