From 4175af7f92ad7357b83ceb56f2a6d42a8243cd80 Mon Sep 17 00:00:00 2001 From: "Suren A. Chilingaryan" Date: Mon, 29 Jul 2024 22:32:00 +0200 Subject: Update documentation & users --- docs/problems.txt | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) (limited to 'docs/problems.txt') diff --git a/docs/problems.txt b/docs/problems.txt index 3b652ec..49137aa 100644 --- a/docs/problems.txt +++ b/docs/problems.txt @@ -170,4 +170,33 @@ Orphaning / pod termination problems in the logs Scenario: * Reported on long running pods with persistent volumes (katrin, adai-db) * Also seems an unrelated set of the problems. - \ No newline at end of file + + +Evicted Pods +============ + Pods are evicted if node running pod becomes unavailable or have not enough resources to run the pod. + - It is possible to lookup which resource is likely triggering event by + > oc describe node ipekatrin2.ipe.kit.edu + Type Status LastHeartbeatTime LastTransitionTime Reason Message + ---- ------ ----------------- ------------------ ------ ------- + OutOfDisk False Tue, 05 Apr 2022 03:24:54 +0200 Tue, 21 Dec 2021 19:09:33 +0100 KubeletHasSufficientDisk kubelet has sufficient disk space available + MemoryPressure False Tue, 05 Apr 2022 03:24:54 +0200 Tue, 21 Dec 2021 19:09:33 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available + DiskPressure False Tue, 05 Apr 2022 03:24:54 +0200 Mon, 04 Apr 2022 10:00:23 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure + Ready True Tue, 05 Apr 2022 03:24:54 +0200 Tue, 21 Dec 2021 19:09:43 +0100 KubeletReady kubelet is posting ready status + The latest transition is 'DiskPressure' happened on Apr 04. So, likely disk is an issue. + + - DiskPressure eviction + * This might happen because the pod writting to much output to the logs (standard ouput). This logs are stored under '/var/lib/origin/openshift.local.volumes/pods/...' + and if growing large might use all the space in '/var' file system. OpenShift is not rotating these logs and have no other mechanisms to prevent large output eventually + causing space issues. So, pods have to rate-limit output to stdout. Otherwise, we need to find misbehaving pods which writes too much... + * Another problem is 'inode' pressure. This can be checked with 'df' and anything above 80% is definitively a sign of a problem + df -i + The particular folder with lots of inodes can be found with the following command: + { find / -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n; } 2>/dev/null + Likely there would be some openshift-related volume logs in '/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/' + Check particularly cronJob logs mounting volumes, e.g. various 'adei' stuff. Can be cleaned with + find /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/ -name '*.log' -delete + + - If resource is not available for long time, node will become NotReady and all pods will be evicted. However, short term problems caused by pod itself are likely cause only eviction of this + particular pod itself only (once pod evicted disk/memory-space is recalimed, logs are deleted). So, it is possible to find the problematic pod by looking which pod was evicted most frequently. + -- cgit v1.2.3