Managed Kubernetes
Latest
Frequently Asked Questions
Solutions
How Tos
Internal Only
Templates
Powered By

Title
Message
Create new category
What is the title of your new category?
Edit page index title
What is the title of the page index?
Edit category
What is the new title of your category?
Edit link
What is the new title and URL of your link?
Excessive Kubernetes Master Pod Restarts Due To ETCD Latency.
Copy Markdown
Open in ChatGPT
Open in Claude
Problem
- One or more "k8s-master" pods (dependent on the number of master nodes) within the kube-system namespace of a Platform9 Managed Kubernetes cluster are showing an excessive number of restarts, e.g.
xxxxxxxxxx❯ kubectl get po -n kube-system k8s-master-172.17.0.14NAME READY STATUS RESTARTS AGEk8s-master-fe9d1e3a-4c43-417b-9720-c2a3d0732d9d000003 3/3 Running 119 27dETCD logs:
Etcd log
{"log":"{\"level\":\"warn\",\"ts\":\"2023-11-16T22:23:08.031Z\",\"caller\":\"etcdserver/util.go:163\",\"msg\":\"apply request took too long\",\"took\":\"9.069436957s\",\"expected-duration\":\"100ms\",\"prefix\":\"read-only range \",\"request\":\"key:\\\"/registry/horizontalpodautoscalers/\\\" range_end:\\\"/registry/horizontalpodautoscalers0\\\" limit:10000 \",\"response\":\"\",\"error\":\"etcdserver: request timed out\"}\n","stream":"stderr","time":"2023-11-16T22:23:08.031887945Z"}Kube-controller log:
kube-controller log
{"log":"E1116 23:58:01.449568 1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get \"https://localhost:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n","stream":"stderr","time":"2023-11-16T23:58:01.450015023Z"}Kube-apiserver log:
kube-api log
{"log":"E1116 23:59:06.732861 1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:\"etcdserver: request timed out\"}: etcdserver: request timed out\n","stream":"stderr","time":"2023-11-16T23:59:06.742609594Z"}Nodelet log:
Nodelet log
{"L":"INFO","T":"2023-11-16T17:21:19.642-0700","C":"command/command.go:120","M":"[2023-11-16 17:21:19] I1116 17:21:19.625769 3204532 request.go:1123] Response Body: {\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"etcdserver: request timed out\",\"code\":500}"}Environment
- Platform9 Managed Kubernetes - v5.4 and above.
- ETCD.
Cause
- Etcd heartbeats are timing out, resulting in frequent leader elections.
- The kube-controller-manager and kube-scheduler container logs show etcd read timeouts due to the leader elections, resulting in the restart of these containers.
Resolution
Identifying the ETCD latency which can be caused due to slow or overloaded ETCD disk. To test ETCD latency we have two options listed below:
- Using FIO tool- Install fio and run the below mentioned command on the master node:
FIO
# sudo apt-get install fio# fio --version ## should be higher than 3.5 at least# sudo mkdir /var/opt/pf9/kube/etcd/data/test-data# sudo fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/opt/pf9/kube/etcd/data/test-data --size=22m --bs=2300 --name=hk-prod-test- Using ETCD Perf: Run the below commands in the master node:
ETCD Perf
x
# cd /opt/pf9/pf9-kube/bin/# ./etcdctl check perf --load='l' # /opt/pf9/pf9-kube/bin/etcdctl --cacert=/etc/pf9/kube.d/certs/etcdctl/etcd/ca.crt --cert=/etc/pf9/kube.d/certs/etcdctl/etcd/request.crt --key=/etc/pf9/kube.d/certs/etcdctl/etcd/request.key check perf --load='l'Make sure the hardware requirements are met as per the official ETCD documentation to avoid ETCD latency issues. And make the necessary disk-level changes as recommended.
The default values of heartbeat-interval and election-timeout are 100ms and 1000ms, respectively.
For Azure, we've had to increase these values to 1000ms and 10000ms. These defaults are included in Platform9 Managed Kubernetes v4.1+.
Additional Information
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches
Last updated on
Was this page helpful?
Next to read:
Applications Failing With Error "504 Gateway TimeoutDiscard Changes
Do you want to discard your current changes and overwrite with the template?
Archive Synced Block
Message
Create new Template
What is this template's title?
Delete Template
Message