From 1e8153c2af051ce48d5aa08d3dbdc0d0970ea532 Mon Sep 17 00:00:00 2001 From: "Suren A. Chilingaryan" Date: Wed, 22 Jan 2020 03:16:06 +0100 Subject: Document another problem with lost IPs and exhausting of SDN IP range --- docs/problems.txt | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) (limited to 'docs/problems.txt') diff --git a/docs/problems.txt b/docs/problems.txt index 099193a..3b652ec 100644 --- a/docs/problems.txt +++ b/docs/problems.txt @@ -13,13 +13,14 @@ Client Connection box pops up. -Rogue network interfaces on OpenVSwitch bridge -============================================== +Leaked resourced after node termination: Rogue network interfaces on OpenVSwitch bridge, unreclaimed IPs in pod-network, ... +======================================= Sometimes OpenShift fails to clean-up after terminated pod properly. The actual reason is unclear, but severity of the problem is increased if extreme amount of images is presented in local Docker storage. Several thousands is defenitively intensifies this problem. - * The issue is discussed here: + * The issues are discussed here: https://bugzilla.redhat.com/show_bug.cgi?id=1518684 + https://bugzilla.redhat.com/show_bug.cgi?id=1518912 * And can be determined by looking into: ovs-vsctl show @@ -30,6 +31,12 @@ Rogue network interfaces on OpenVSwitch bridge * With time, the new rogue interfaces are created faster and faster. At some point, it really slow downs system and causes pod failures (if many pods are re-scheduled in paralllel) even if not so many rogue interfaces still present + * Furthermore, there is a limit range of IPs allocated for pod-network at each node. Whatever + it is caused by tje lost bridges or it is an unrellated resource-management problem in OpenShift, + but this IPs also start to leak. As number of leaked IPs increase, it gets longer for OpenShift + to find IP which is still free and pod schedulling slows down further. At some point, the complete + range of IPs will get exhausted and pods will fail to start (after long waiting in Scheduling state) + on the affected node. * Even if not failed, it takes several minutes to schedule the pod on the affected nodes. Cause: @@ -38,7 +45,6 @@ Rogue network interfaces on OpenVSwitch bridge * Could be related to 'container kill failed' problem explained in the section bellow. Cannot kill container ###: rpc error: code = 2 desc = no such process - Solutions: * According to RedHat the temporal solution is to reboot affected node (just temporarily reduces the rate how often the new spurious interfaces appear, but not preventing the problem completely in my case). The problem @@ -46,8 +52,10 @@ Rogue network interfaces on OpenVSwitch bridge * The simplest work-around is to just remove rogue interface. They will be re-created, but performance problems only starts after hundreds accumulate. ovs-vsctl del-port br0 - * It seems helpful to purge unused docker images to reduce the rate of interface apperance. - + * Similarly, the unused IPs could be cleaned in "/var/lib/cni/networks/openshift-sdn", just check if docker + image referenced in each IP file is still running with "docker ps". Afterwards, the 'orgin-node' service + should be restarted. + * It seems also helpful to purge unused docker images to reduce the rate of interface apperance. Status: * Cron job is installed which cleans rogue interfaces as they number hits 25. -- cgit v1.2.3