diff options
Diffstat (limited to 'logs/2019.09.26/analysis.txt')
-rw-r--r-- | logs/2019.09.26/analysis.txt | 35 |
1 files changed, 35 insertions, 0 deletions
diff --git a/logs/2019.09.26/analysis.txt b/logs/2019.09.26/analysis.txt new file mode 100644 index 0000000..26c123b --- /dev/null +++ b/logs/2019.09.26/analysis.txt @@ -0,0 +1,35 @@ +Sep 24 13:34:18 ipekatrin2 kernel: Memory cgroup out of memory: Kill process 57372 (mongod) score 1984 or sacrifice child +Sep 24 13:34:22 ipekatrin2 origin-node: I0924 13:34:22.704691 93115 kubelet.go:1921] SyncLoop (container unhealthy): "mongodb-2-6j5w7_services(b350130e-ac45-11e9-bbd6-0cc47adef0e6)" +Sep 24 13:34:29 ipekatrin2 origin-node: I0924 13:34:29.774596 93115 kubelet.go:1888] SyncLoop (PLEG): "mongodb-2-6j5w7_services(b350130e-ac45-11e9-bbd6-0cc47adef0e6)", event: &pleg.PodLifecycleEvent{ID:"b350130e-ac45-11e9-bbd6-0cc47adef0e6", Type:"ContainerStarted", Data:"1d485a4dd86b8f7ff24649789eee000d55319ef64d9b447c532a43fadce2831e"} +Sep 24 13:34:35 ipekatrin2 origin-node: I0924 13:34:35.177258 93115 roundrobin.go:310] LoadBalancerRR: Setting endpoints for services/mongodb:mongo to [10.130.0.91:27017] +Sep 24 13:34:35 ipekatrin2 origin-node: I0924 13:34:35.177323 93115 roundrobin.go:240] Delete endpoint 10.130.0.91:27017 for service "services/mongodb:mongo" +... Nothing about mongod on any node until the mass destruction .... +==== +Sep 25 07:52:00 ipekatrin2 origin-node: I0925 07:52:00.422291 93115 kubelet.go:1796] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.448988393s ago; threshold is 3m0s] +Sep 25 07:52:31 ipekatrin2 origin-master-controllers: I0925 07:52:31.761961 109653 nodecontroller.go:617] Node is NotReady. Adding Pods on Node ipekatrin2.ipe.kit.edu to eviction queue +Sep 25 07:52:47 ipekatrin2 origin-master-controllers: I0925 07:52:47.584394 109653 controller_utils.go:89] Starting deletion of pod services/mongodb-2-6j5w7 +Sep 25 07:56:04 ipekatrin2 origin-node: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111) +Sep 25 08:07:41 ipekatrin2 systemd-logind: Failed to start session scope session-118144.scope: Connection timed out +==== +Sep 26 08:53:19 ipekatrin2 origin-master-controllers: I0926 08:53:19.435468 109653 nodecontroller.go:644] Node is unresponsive. Adding Pods on Node ipekatrin3.ipe.kit.edu to eviction queues: +Sep 26 08:54:09 ipekatrin3 kernel: glustertimer invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=-999 +Sep 26 08:54:27 ipekatrin3 kernel: Out of memory: Kill process 91288 (mysqld) score 1075 or sacrifice child +Sep 26 08:54:14 ipekatrin2 etcd: lost the TCP streaming connection with peer 2696c5f68f35c672 (stream MsgApp v2 reader) +Sep 26 08:55:02 ipekatrin2 etcd: established a TCP streaming connection with peer 2696c5f68f35c672 (stream MsgApp v2 writer) +Sep 26 08:57:54 ipekatrin3 origin-node: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111) + +Sep 26 09:34:20 ipekatrin2 origin-node: I0926 09:34:20.361306 93115 kubelet.go:1796] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 8m12.284528292s ago; threshold is 3m0s] + + +0. ipeatrin1 (and to lesser degree ipekatrin2) was affected by huge number of images slowing down the Docker communication. + Scheduling on ipekatrin1 was disabled for deveopment purposes. +1. On 24th monogodb used more memory when allowed by 'dc' configuration and was killed by OpenShift/cgroup OOM. +2. For some reason, the service was not restarted making rocketchat un-operationa; +3. On 25.09 7:52 katrin2 get unhealthy and unschedularble due to PLEG timeouts? + * Pods migrating ipekatrin3. Performance problems due to mass migration causing systemd (and mount problems) + * System recovered relatively quickly, but few pods was running on ipekatrin2 and ipekatrin3 was severely overloaded +4. On 26.09 8:53 System OOM killer was triggered on katrin3 due to overall lack of memory + * Node was marked unhealthy and pods eviction was triggered + * etcd problems registered, making real problems in cluster fabric +5. On 26.09 9:34 PLEG recovered for some reason. + * Most of the pods were rescheduled automatically and the systemwas recovered occasionally. |