Document another problem with lost IPs and exhausting of SDN IP range

author: Suren A. Chilingaryan <csa@suren.me> 2020-01-22 03:16:06 +0100
committer: Suren A. Chilingaryan <csa@suren.me> 2020-01-22 03:16:06 +0100
commit: 1e8153c2af051ce48d5aa08d3dbdc0d0970ea532 (patch)
tree: 7bb1441a87521aa8c3c5524f95fa645850a6826e
parent: e0b1b53f21095707af87a095934e971d788a90c7 (diff)
download: ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.gz
ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.bz2
ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.xz
ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.zip
6 files changed, 65 insertions, 12 deletions
diff --git a/docs/logs.txt b/docs/logs.txt
index d33ef0a..0a3b269 100644
--- a/docs/logs.txt
+++ b/docs/logs.txt
@@ -11,6 +11,11 @@
     Despite the errror, the containers are actually killed and pods destroyed. However, this error likely triggers
     problem with rogue interfaces staying on the OpenVSwitch bridge.
 
+  - RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "kdb-server-testing-180-build_katrin" network: CNI request failed with status 400: 'failed to run IPAM for 4b56e403e2757d38dca67831ce09e10bc3b3f442b6699c20dcd89556763e2d5d: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn
+    CreatePodSandbox for pod "kdb-server-testing-180-build_katrin(65640902-3bd6-11ea-bbd6-0cc47adef0e6)" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "kdb-server-testing-180-build_katrin" network: CNI request failed with status 400: 'failed to run IPAM for 4b56e403e2757d38dca67831ce09e10bc3b3f442b6699c20dcd89556763e2d5d: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn
+     Indicates exhaustion of the IP range of the pod network on the node. This also seems triggered by problems with resource management and
+     pereodic manual clean-up is required.
+
  - containerd: unable to save f7c3e6c02cdbb951670bc7ff925ddd7efd75a3bb5ed60669d4b182e5337dec23:d5b9394468235f7c9caca8ad4d97e7064cc49cd59cadd155eceae84545dc472a starttime: read /proc/81994/stat: no such process
    containerd: f7c3e6c02cdbb951670bc7ff925ddd7efd75a3bb5ed60669d4b182e5337dec23:d5b9394468235f7c9caca8ad4d97e7064cc49cd59cadd155eceae84545dc472a (pid 81994) has become an orphan, killing it
     Seems a bug in docker 1.12* which is resolved in 1.13.0rc2. No side effects according to the issue.
diff --git a/docs/problems.txt b/docs/problems.txt
index 099193a..3b652ec 100644
--- a/docs/problems.txt
+++ b/docs/problems.txt
@@ -13,13 +13,14 @@ Client Connection
    box pops up.
 
 
-Rogue network interfaces on OpenVSwitch bridge
-==============================================
+Leaked resourced after node termination: Rogue network interfaces on OpenVSwitch bridge, unreclaimed IPs in pod-network, ...
+=======================================
  Sometimes OpenShift fails to clean-up after terminated pod properly. The actual reason is unclear, but
  severity of the problem is increased if extreme amount of images is presented in local Docker storage.
  Several thousands is defenitively intensifies this problem.
-  * The issue is discussed here:
+  * The issues are discussed here:
         https://bugzilla.redhat.com/show_bug.cgi?id=1518684
+        https://bugzilla.redhat.com/show_bug.cgi?id=1518912
   * And can be determined by looking into:
     ovs-vsctl show
 
@@ -30,6 +31,12 @@ Rogue network interfaces on OpenVSwitch bridge
   * With time, the new rogue interfaces are created faster and faster. At some point, it really
   slow downs system and causes pod failures (if many pods are re-scheduled in paralllel) even 
   if not so many rogue interfaces still present
+  * Furthermore, there is a limit range of IPs allocated for pod-network at each node. Whatever 
+  it is caused by tje lost bridges or it is an unrellated resource-management problem in OpenShift,
+  but this IPs also start to leak. As number of leaked IPs increase, it gets longer for OpenShift
+  to find IP which is still free and pod schedulling slows down further. At some point, the complete
+  range of IPs will get exhausted and pods will fail to start (after long waiting in Scheduling state)
+  on the affected node.
   * Even if not failed, it takes several minutes to schedule the pod on the affected nodes.
 
  Cause:
@@ -38,7 +45,6 @@ Rogue network interfaces on OpenVSwitch bridge
   * Could be related to 'container kill failed' problem explained in the section bellow.
      Cannot kill container ###: rpc error: code = 2 desc = no such process
 
-         
  Solutions:
   * According to RedHat the temporal solution is to reboot affected node (just temporarily reduces the rate how 
   often the new spurious interfaces appear, but not preventing the problem completely in my case). The problem
@@ -46,8 +52,10 @@ Rogue network interfaces on OpenVSwitch bridge
   * The simplest work-around is to just remove rogue interface. They will be re-created, but performance
   problems only starts after hundreds accumulate.
     ovs-vsctl del-port br0 <iface>
-  * It seems helpful to purge unused docker images to reduce the rate of interface apperance.
-  
+  * Similarly, the unused IPs could be cleaned in "/var/lib/cni/networks/openshift-sdn", just check if docker 
+  image referenced in each IP file is still running with "docker ps". Afterwards, the 'orgin-node' service
+  should be restarted.
+  * It seems also helpful to purge unused docker images to reduce the rate of interface apperance.
   
  Status:
    * Cron job is installed which cleans rogue interfaces as they number hits 25.
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index fd57150..1f52fe9 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -132,7 +132,7 @@ etcd (and general operability)
  certificate verification code which introduced in etcd 3.2. There are multiple bug repports on
  the issue.
  
-pods (failed pods, rogue namespaces, etc...)
+pods: very slow scheduling (normal start time in seconds range), failed pods, rogue namespaces, etc...
 ====
  - OpenShift has numerous problems with clean-up resources after the pods. The problems are more likely to happen on the 
  heavily loaded systems: cpu, io, interrputs, etc.
@@ -151,6 +151,8 @@ pods (failed pods, rogue namespaces, etc...)
     * Apart from overloaded nodes (max cpu%, io, interrupts), PLEG issues can be caused by 
         1. Excessive amount of resident docker images on the node (see bellow)
         2. This can cause and will be further amplified by the spurious interfaces on OpenVSwich (see bellow)
+        3. Another side effect is exhausing IPs in pod network on the node as their are also not cleaned properly (see bellow). 
+        As IPs get exhausted, scheduling penalities also rise and at some point pods will fail to schedule (but will be displayed as Ready)
         x. Nuanced issues between kubelet, docker,   logging, networking and so on, with remediation of the issue sometimes being brutal (restarting all nodes etc, depending on the case).
             https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-496818225
     * The problem is not bound to CronJobs, but having regular scheduled jobs make it presence significantly more visible. 
@@ -194,6 +196,24 @@ pods (failed pods, rogue namespaces, etc...)
         https://bugzilla.redhat.com/show_bug.cgi?id=1518684
         https://bugzilla.redhat.com/show_bug.cgi?id=1518912
 
+ - Another related problem causing long delays in pod scheduling is indicated by the following lines 
+ "failed to run CNI IPAM ADD: no IP addresses available in network" in the system logs:
+        Jan 21 14:43:01 ipekatrin2 origin-node: E0121 14:43:01.066719   93115 remote_runtime.go:91] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "kdb-server-testing-180-build_katrin" network: CNI request failed with status 400: 'failed to run IPAM for 4b56e403e2757d38dca67831ce09e10bc3b3f442b6699c20dcd89556763e2d5d: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn
+        Jan 21 14:43:01 ipekatrin2 origin-node: E0121 14:43:01.068021   93115 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "kdb-server-testing-180-build_katrin(65640902-3bd6-11ea-bbd6-0cc47adef0e6)" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "kdb-server-testing-180-build_katrin" network: CNI request failed with status 400: 'failed to run IPAM for 4b56e403e2757d38dca67831ce09e10bc3b3f442b6699c20dcd89556763e2d5d: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn
+  * The problem that OpenShift (due to "ovs" problems or something else) fails to clean the network interfaces. You can check for the currently 
+    allocated ips in:
+        /var/lib/cni/networks/openshift-sdn
+  * This can be cleaned (but better not for cron, as I don't know what happens if IP is already assigned but the docker container is not
+    running yet). Anyway it is better first to disable scheduling on the node (and may be even evict all running pods):
+        oc adm manage-node <nodes> --schedulable=false
+        oc adm manage-node <nodes> --evacuate
+        for hash in $(tail -n +1 * | grep '^[A-Za-z0-9]*$' | cut -c 1-8); do if [ -z $(docker ps -a | grep $hash | awk '{print $1}') ]; then grep -ilr $hash ./; fi; done | xargs rm
+    After this, the origin-node should be restarted and scheduling could be re-enabled
+        systemctl restart origin-node
+        oc adm manage-node <nodes> --schedulable=true
+  * It doesn't seem to be directly triggered by lost ovs interfaces (more interfaces are lost than ips). So, it is not possible to release
+    IPs one by one.
+
  - After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to
     * kube-service-catalog/controller-manager
     * openshift-template-service-broker/api-server
@@ -322,7 +342,7 @@ MySQL
  the load).
         SHOW PROCESSLIST;
  The remedy is to restart slave MySQL with 'slave_parallel_workers=0', give it a time to go, and then
- restart back in the standard multithreading mode. This can be achieved by editing 'statefulset/mysql-slave-0'
+ restart back in the standard multithreading mode. This can be achieved by editing 'statefulset/mysql-slave'
  and setting environmental vairable 'MYSQL_SLAVE_WORKERS' to 0 and, then, back to original value (16 currently).
 
 - This could be not end of this. The execution of statments from the log could 'stuck' because of the some "bad"
@@ -344,12 +364,12 @@ MySQL
         SET @@SESSION.GTID_NEXT='4ab8feff-5272-11e8-9320-08002715584a:201840'
     This is the gtid of the next transaction.
  * So, the following commands should be executed on the slave MySQL server (see details, https://www.thegeekdiary.com/how-to-skip-a-transaction-on-mysql-replication-slave-when-gtids-are-enabled/)
-        SLAVE STOP;
+        STOP SLAVE;
         SET @@SESSION.GTID_NEXT='<found_gtid_of_next_transaction>';
         BEGIN;
         COMMIT;
         SET GTID_NEXT='AUTOMATIC';
-        SLAVE START;
+        START SLAVE;
  * It is also possible to review the stuck transaction on the slave mysql node. In the '/var/lib/mysql/data' run 
         mysqlbinlog --start-position=<Relay_Log_Pos> <Relay_Log_File>
     
diff --git a/docs/vision.txt b/docs/vision.txt
index 0be70ba..bf6de57 100644
--- a/docs/vision.txt
+++ b/docs/vision.txt
@@ -4,6 +4,7 @@ Ands v.2
     * This actually seems problematic in CentOS-8. Something, like 'rsync portage portage/.tmp' is EXREMELY slow (<1 MB/s). Just check eix-sync.
  - Integrate fast Ethernet and use conteiner native networking. OpenVSwitch is slow and causes problems.
  - Do not run pods on Master nodes, but Gluster and a few databases pods (MySQL) are OK (multiple reasons, especially mounting a lot of Gluster Volumes)
+    * Restrict all periodic jobs to a specific node: easy to re-install (non-master), fast SSD storage, ...?
  - Object Storage should be integrated, either Gluster Block is ready for production or we have to use Ceph as well
  - Automatic provisioning would be much better then handling volumes trough Ands. Basically, this will render Ands redundant. We can switch to Helm, etc.
    But, we need ability to easily understand which volume belong to which pod/namespace and automatically kill redundant volumes. 
diff --git a/roles/ands_monitor/templates/scripts/check_server_status.sh.j2 b/roles/ands_monitor/templates/scripts/check_server_status.sh.j2
index c2849f4..e49ec97 100755
--- a/roles/ands_monitor/templates/scripts/check_server_status.sh.j2
+++ b/roles/ands_monitor/templates/scripts/check_server_status.sh.j2
@@ -4,6 +4,8 @@ fs=`df -lm / | grep -vi Filesystem | sed -e 's/[[:space:]]\+/ /g' | cut -d ' ' -
 datafs=`df -lm /mnt/ands | grep -vi Filesystem | sed -e 's/[[:space:]]\+/ /g' | cut -d ' ' -f 4`
 mem=`free -g | grep "Mem" | sed -e 's/[[:space:]]\+/ /g' | cut -d ' ' -f 7`
 cpu=`uptime | sed -e "s/[[:space:]]/\n/g" -e s/,/./g | tail -n 1`
+max_cpu=$(cat /proc/cpuinfo | grep processor | tail -n 1 | cut -d ':' -f 2)
+cpu_usage=$(echo "100 * $cpu / ( $max_cpu + 1)" | bc) #"
 
 if [ $fs -le 8192 ]; then
     echo "Only $(($fs / 1024)) GB left in the root file system"
@@ -17,8 +19,8 @@ if [ $mem -le 16 ]; then
     echo "The system is starving on memory, $mem GB left free"
 fi
 
-if [ `echo "$cpu < 20" | bc` -eq 0 ]; then
-    echo "The system is starving on cpu, $cpu is load average for the last 15 min"
+if [ `echo "$cpu_usage < 80" | bc` -eq 0 ]; then
+    echo "The system is starving on cpu, $cpu ($cpu_usage%) is load average for the last 15 min"
 fi
 
 vol=$(/opt/MegaRAID/storcli/storcli64 /c0/v0 show | grep -P "^0/0" | grep "Optl" | wc -l)
diff --git a/roles/ands_monitor/templates/scripts/clean_sdn_ips.sh.j2 b/roles/ands_monitor/templates/scripts/clean_sdn_ips.sh.j2
new file mode 100755
index 0000000..c938121
--- /dev/null
+++ b/roles/ands_monitor/templates/scripts/clean_sdn_ips.sh.j2
@@ -0,0 +1,17 @@
+#! /bin/bash
+
+host=$(uname -n)
+
+# Check node is in the cluster and we have permissions to access OpenShift
+oc get node "$host" &> /dev/null
+[ $? -ne 0 ] && { echo "Can't query node $host, check cluster configuration and permissions"; exit; }
+
+oc adm manage-node "$host" --schedulable=false &> /dev/null
+[ $? -ne 0 ] && { echo "Failed to disable scheduling on the node $host"; exit; }
+
+for hash in $(find /var/lib/cni/networks/openshift-sdn/* -mmin +120 -print0 | xargs -0 tail -n +1 | grep '^[A-Za-z0-9]*$' | cut -c 1-8); do if [ -z $(docker ps -a | grep $hash | awk '{print $1}') ]; then grep -ilr $hash ./; fi; done | xargs rm
+
+systemctl restart origin-node
+
+oc adm manage-node "$host" --schedulable=true &> /dev/null
+[ $? -ne 0 ] && echo "Failed to re-nablee scheduling on the node $host"
author	Suren A. Chilingaryan <csa@suren.me>	2020-01-22 03:16:06 +0100
committer	Suren A. Chilingaryan <csa@suren.me>	2020-01-22 03:16:06 +0100
commit	1e8153c2af051ce48d5aa08d3dbdc0d0970ea532 (patch)
tree	7bb1441a87521aa8c3c5524f95fa645850a6826e
parent	e0b1b53f21095707af87a095934e971d788a90c7 (diff)
download	ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.gz ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.bz2 ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.xz ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.zip