From 4175af7f92ad7357b83ceb56f2a6d42a8243cd80 Mon Sep 17 00:00:00 2001
From: "Suren A. Chilingaryan" <csa@suren.me>
Date: Mon, 29 Jul 2024 22:32:00 +0200
Subject: Update documentation & users

---
 docs/problems.txt | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

(limited to 'docs/problems.txt')

diff --git a/docs/problems.txt b/docs/problems.txt
index 3b652ec..49137aa 100644
--- a/docs/problems.txt
+++ b/docs/problems.txt
@@ -170,4 +170,33 @@ Orphaning / pod termination problems in the logs
   Scenario:
     * Reported on long running pods with persistent volumes (katrin, adai-db)
     * Also seems an unrelated set of the problems.
-    
\ No newline at end of file
+    
+    
+Evicted Pods
+============
+ Pods are evicted if node running pod becomes unavailable or have not enough resources to run the pod.
+ - It is possible to lookup which resource is likely triggering event by 
+    > oc describe node ipekatrin2.ipe.kit.edu
+     Type                  Status  LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
+     ----                  ------  -----------------                       ------------------                      ------                          -------
+     OutOfDisk             False   Tue, 05 Apr 2022 03:24:54 +0200         Tue, 21 Dec 2021 19:09:33 +0100         KubeletHasSufficientDisk        kubelet has sufficient disk space available
+     MemoryPressure        False   Tue, 05 Apr 2022 03:24:54 +0200         Tue, 21 Dec 2021 19:09:33 +0100         KubeletHasSufficientMemory      kubelet has sufficient memory available
+     DiskPressure          False   Tue, 05 Apr 2022 03:24:54 +0200         Mon, 04 Apr 2022 10:00:23 +0200         KubeletHasNoDiskPressure        kubelet has no disk pressure
+     Ready                 True    Tue, 05 Apr 2022 03:24:54 +0200         Tue, 21 Dec 2021 19:09:43 +0100         KubeletReady                    kubelet is posting ready status
+  The latest transition is 'DiskPressure' happened on Apr 04. So, likely disk is an issue.
+  
+ - DiskPressure eviction
+    * This might happen because the pod writting to much output to the logs (standard ouput). This logs are stored under '/var/lib/origin/openshift.local.volumes/pods/...'
+    and if growing large might use all the space in '/var' file system. OpenShift is not rotating these logs and have no other mechanisms to prevent large output eventually 
+    causing space issues. So, pods have to rate-limit output to stdout. Otherwise, we need to find misbehaving pods which writes too much...
+    * Another problem is 'inode' pressure. This can be checked with 'df' and anything above 80% is definitively a sign of a problem
+            df -i
+      The particular folder with lots of inodes can be found with the following command:
+            { find / -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n; } 2>/dev/null
+      Likely there would be some openshift-related volume logs in '/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/'
+      Check particularly cronJob logs mounting volumes, e.g. various 'adei' stuff. Can be cleaned with
+            find  /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/ -name '*.log' -delete
+
+ - If resource is not available for long time, node will become NotReady and all pods will be evicted. However, short term problems caused by pod itself are likely cause only eviction of this
+ particular pod itself only (once pod evicted disk/memory-space is recalimed, logs are deleted). So, it is possible to find the problematic pod by looking which pod was evicted most frequently.
+ 
-- 
cgit v1.2.3