From 18da6e4b5942f4fcaa9db3ba3bf1dfcd1857e9ea Mon Sep 17 00:00:00 2001
From: "Suren A. Chilingaryan" <csa@suren.me>
Date: Thu, 10 Jan 2019 06:43:26 +0100
Subject: Update troubleshooting documentation

---
 docs/problems.txt | 59 +++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 53 insertions(+), 6 deletions(-)

(limited to 'docs/problems.txt')
diff --git a/docs/problems.txt b/docs/problems.txt
index 4be9dc7..fa88afe 100644
--- a/docs/problems.txt
+++ b/docs/problems.txt
@@ -17,6 +17,9 @@ Rogue network interfaces on OpenVSwitch bridge
   * As number of rogue interfaces grow, it start to have impact on performance. Operations with
   ovs slows down and at some point the pods schedulled to the affected node fail to start due to
   timeouts. This is indicated in 'oc describe' as: 'failed to create pod sandbox'
+  * With time, the new rogue interfaces are created faster and faster. At some point, it really
+  slow downs system and causes pod failures (if many pods are re-scheduled in paralllel) even 
+  if not so many rogue interfaces still present
 
  Cause:
   * Unclear, but it seems periodic ADEI cron jobs causes the issue.
@@ -25,7 +28,7 @@ Rogue network interfaces on OpenVSwitch bridge
 
          
  Solutions:
-  * According to RedHat the temporal solution is to reboot affected node (not tested yet). The problem
+  * According to RedHat the temporal solution is to reboot affected node (not helping in my case). The problem
   should go away, but may re-apper after a while. 
   * The simplest work-around is to just remove rogue interface. They will be re-created, but performance
   problems only starts after hundreds accumulate.
@@ -35,6 +38,54 @@ Rogue network interfaces on OpenVSwitch bridge
    * Cron job is installed which cleans rogue interfaces as they number hits 25.
 
 
+Hanged pods
+===========
+ POD processes may stuck. Normally, such processes will be detected using 'liveliness' probe and will be 
+ restarted by OpenShift if necessary. However, ocasionally processes may stuck in syscalls (such processes
+ are marked with 'D' in ps). These processes can't be killed with SIGKILL and OpenShift will not be able
+ to terminate them leaving indefinitely in 'Terminating' status.
+ 
+ Problems:
+  * Pods stuck in 'Terminating' status preventing start of new replicas. In case of 'jobs', large number
+  of 'Terminating' pods could overload OpenShift controllers.
+
+ Cause:
+  * One reason are the spurious locks on the GlusterFS file system. On CentOS 7, it impossible to interrupt 
+  process waiting for the lock initiated by blocking 'flock' call. It gets stuck in a syscall and is indicated 
+  by state 'D' in the ps output. Sometimes, GlusterFS may kept files locked despite that processes holding these
+  locks have already exited/crashed. I am not sure about exact conditions when this happens, but it seems for 
+  instance the crashed Docker daemon may cause effect if some of the running containers were holding locks on 
+  GFS at the moment of crash. 
+    - We can verify if this is the case by checking if process associated with the problematic pod is stuck in 
+     state 'D' and by analyzing its backtrace (/proc/<pid>/stack).
+    
+
+ Solutions: 
+  * Avoid blocking flock on GlusterFS. Use polling with sleep instead. To release already stuck pods, we need
+  to find and destroy problematic locks. GlusterFS allows to debug locks using 'statedump', check GlusterFS 
+  documentation for details. While there is also mechanism to clean such locks. It is not always working. 
+  Alternative is to remove locked files AND keep them removed for a while until all blocked 'flock' syscalls 
+  are released.
+
+
+Hanged MySQL connection
+=======================
+ Stale MySQL locks may prevent new clients connecting to certain tables in MySQL database. 
+ 
+ Problems:
+  * The problem may affect either only clients trying to obtain 'write' access or for all usage patterns. In the first case, 
+  it will cause ADEI 'caching' threads to hang indefinitely and 'maintain' threads will be terminated after specified timeout
+  leaving administrative scripts un-processed.
+  
+ Cause:
+  * For whatever reason, some crashed clients may preserve the locks. I believe it could also be caused by 
+  crashed 'docker' daemon as one possibel reason. The problem can be found bt executing 'SHOW PROCESSLIST' 
+  on MySQL server. More diagnostic possibilities are discussed in MySQL notes.
+  
+ Solutions;
+  * Normally, restarting MySQL pod should be enough.
+
+
 Orphaning / pod termination problems in the logs
 ================================================
  There is several classes of problems reported with unknow reprecursions in the system log. Currently, I
@@ -96,8 +147,4 @@ Orphaning / pod termination problems in the logs
   Scenario:
     * Reported on long running pods with persistent volumes (katrin, adai-db)
     * Also seems an unrelated set of the problems.
-
-
-
-
-
+    
\ No newline at end of file
-- 
cgit v1.2.3