The services has to be running ------------------------------ Etcd: - etcd Node: - origin-node Master nodes: - origin-master-api - origin-master-controllers - origin-master is not running Required Services: - lvm2-lvmetad.socket - lvm2-lvmetad.service - docker - it may happen that service is alive according to systemd, but does not respond ('docker ps' timeouts) - NetworkManager - firewalld - dnsmasq - openvswitch Extra Services: - ssh - ntp - openvpn - ganesha (on master nodes, optional) Pods has to be running ---------------------- Kubernetes System - Integration with public cloud resources as it seems - kube-service-catalog/apiserver - kube-service-catalog/controller-manager - this seems optional OpenShift Main Services - default/docker-registry - default/registry-console - default/router (3 replicas) - openshift-template-service-broker/api-server (daemonset, on all nodes) OpenShift Secondary Services - openshift-ansible-service-broker/asb - this is optional - openshift-ansible-service-broker/asb-etcd GlusterFS - glusterfs-storage (daemonset, on all storage nodes) - glusterblock-storage-provisioner-dc - heketi-storage Metrics (openshift-infra): - hawkular-cassandra - hawkular-metrics - heapster Debugging ========= - Ensure system consistency as explained in 'consistency.txt' (incomplete) - Check current pod logs and possibly logs for last failed instance oc logs --tail=100 [-p] - dc/name or ds/name as well - Verify initialization steps (check if all volumes are mounted) oc describe - Security (SCC) problems are visible if replica controller is queried oc -n adei get rc/mysql-1 -o yaml - It worth looking the pod environment oc env po --list - It worth connecting running container with 'rsh' session and see running processes, internal logs, etc. The 'debug' session will start a new instance of the pod. - If try looking if corresponding pv/pvc are bound. Check logs for pv. * Even if 'pvc' is bound. The 'pv' may have problems with its backend. * Check logs here: /var/lib/origin/plugins/kubernetes.io/glusterfs/ - Another frequent problems is failing 'postStart' hook. Or 'livenessProbe'. As it immediately crashes it is not possible to connect. Remedies are: * Set larger initial delay to check the probe. * Try to remove hook and execute it using 'rsh'/'debug' - Determine node running the pod and check the host logs in '/var/log/messages' * Particularly logs of 'origin-master-controllers' are of interest - Check which docker images are actually downloaded on the node docker images network ======= - There is a NetworkManager script which should adjust /etc/resolv.conf to use local dnsmasq server. This is based on '/etc/NetworkManager/dispatcher.d/99-origin-dns.sh' which does not play well if OpenShift is running on non-default network interface. I provided a patched version, but it worth verifying * that nameserver is pointing to the host itself (but not localhost, this is important to allow running pods to use it) * that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf' * that correct upstream nameservers are listed in '/etc/origin/node/resolv.conf' * In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons) If script misbehaves, it is possible to call it manually like that DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up etcd (and general operability) ==== - Few of this sevices may seem running accroding to 'systemctl', but actually misbehave. Then, it may be needed to restart them manually. I have noticed it with * lvm2-lvmetad.socket (pvscan will complain on problems) * node-origin * glusterd in container (just kill the misbehaving pod, it will be recreated) * etcd but BEWARE of too entusiastic restarting: - However, restarting etcd many times is BAD as it may trigger a severe problem with 'kube-service-catalog/apiserver'. The bug description is here https://github.com/kubernetes/kubernetes/issues/47131 - Due to problem mentioned above, all 'oc' queries are very slow. There is not proper solution suggested. But killing the 'kube-service-catalog/apiserver' helps for a while. The pod is restarted and response times are back in order. * Another way to see this problem is quering 'healthz' service which would tell that there is too many clients and, please, retry later. curl -k https://apiserver.kube-service-catalog.svc/healthz - On node crash, the etcd database may get corrupted. * There is no easy fix. Backup/restore is not working. * Easiest option is to remove the failed etcd from the cluster. etcdctl3 --endpoints="192.168.213.1:2379" member list etcdctl3 --endpoints="192.168.213.1:2379" member remove * Add it to [new_etcd] section in inventory and run openshift-etcd to scale-up etcd cluster. - There is a helth check provided by the cluster curl -k https://apiserver.kube-service-catalog.svc/healthz it may complain about etcd problems. It seems triggered by OpenShift upgrade. The real cause and remedy is unclear, but the installation is mostly working. Discussion is in docs/upgrade.txt - There is also a different etcd which is integral part of the ansible service broker: 'openshift-ansible-service-broker/asb-etcd'. If investigated with 'oc logs' it complains on: 2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "") WARNING: 2018/03/07 20:54:48 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry. Nevertheless, it seems working without much trouble. The error message seems caused by certificate verification code which introduced in etcd 3.2. There are multiple bug repports on the issue. services ======== - kube-service-catalog/controller-manager might stuck in CrashLoopBackOff. It seems doesn't matter in current setup. * The problem is expired certificate of kube-service-catalog/apiserver. This can be checked with curl 'https://172.30.183.21:443/apis/servicecatalog.k8s.io/v1beta1 * The certificates are located in '/etc/origin/service-catalog' and can be verified. * There is possibly a way to renew it. However, while prototyping the cluster, it got severely broken on each time upgrade was executed. The new certificate in 'service-catalog' was one of very few things which actually changed in the upgrade. Therefore, it might be dangerous to replace it. * On other hand, it seems no missin services in the current configuration nodes: domino failures ===== - If OpenShift cluster is overloaded, we might get a domino failures if a single node goes off (even temporarily disconnected, e.g. due to restart of origin-node) and all pods are rescheduled to oterh nodes of the cluster. * Increased load, then, may trigger some other nodes offline (for a short while) and cause all nodes to be rescheduled from them as well. * This might continue infinitely as one node is gets disconnected after another, pods get rescheduled, and process never stops * The only solution is to remove temporarily some pods, e.g. ADEI pods could be easily removed and, then, provivisioned back pods: failed or very slow scheduling (normal start time in seconds range), failed pods, rogue namespaces, etc... ==== - LSDF mounts might cause pod-scheduling to fail * It seems OpenShift tries to index (chroot+chmod) files on mount and timeouts if LSDF volume has too many small files... * Reducing number of files with 'subPath' doesn't help here, but setting more specific 'networkPath' in pv helps * Suggestion is to remove fsGroup from 'dc' definition, but it is added automatically if pods use network volumes, setting volume 'gid' (cifs mount parameters specified in 'mountOptions' in pv definition) to match fsGroup doesn't help either * Timeout seems to be fixed to 2m and is not configurable... * Later versions of OpenShift has 'fsGroupChangePolicy=OnRootMismatch' parameter, but it is not present in 3.9 => Honestly, solution is unclear besides reducing number of files or mounting a small share subset with little fieles - OpenShift has numerous problems with clean-up resources after the pods. The problems are more likely to happen on the heavily loaded systems: cpu, io, interrputs, etc. * This may be indicated in the logs with various errors reporting inability to stop containers/processes, free network and storage resources. A few examples (not complete) dockerd-current: time="2019-09-30T18:46:12.298297013Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 00a456097fcf8d70a0461f05813e5a1f547446dd10b3b43ebc1f0bb09e841d1b: rpc error: code = 2 desc = no such process" origin-node: W0930 18:46:11.286634 2497 util.go:87] Warning: Unmount skipped because path does not exist: /var/lib/origin/openshift.local.volumes/pods/6aecbed1-e3b2-11e9-bbd6-0cc47adef0e6/volumes/kubernetes.io~glusterfs/adei-tmp Error syncing pod 1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6 ("adei-smartgrid-maintain-1569790800-pcmdp_adei(1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6)"), skipping: failed to "CreatePodSandbox" for "adei-smartgrid-maintain-1569790800-pcmdp_adei(1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6)" with CreatePodSandboxError: "CreatePodSandbox for pod \"adei-smartgrid-maintain-1569790800-pcmdp_adei(1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"adei-smartgrid-maintain-1569790800-pcmdp_adei\" network: CNI request failed with status 400: 'failed to Statfs \"/proc/28826/ns/net\": no such file or directory\n'" * A more severe form is then PLEG (POD Lifecycle Event Generator) errors are reported: origin-node: I0925 07:52:00.422291 93115 kubelet.go:1796] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.448988393s ago; threshold is 3m0s] This indicates a severe delays in communication with docker daemon (can be checked with 'docker info') and may result in node marked temporarily NotReady causing 'pod' eviction. As pod eviction causes extensive load on the other nodes (which may also be affected of the same problem), the initial single-node issue may render all cluster unusable. * With mass evictions, the things could get even worse causing faults in etcd communication. This is reported like: etcd: lost the TCP streaming connection with peer 2696c5f68f35c672 (stream MsgApp v2 reader) * Apart from overloaded nodes (max cpu%, io, interrupts), PLEG issues can be caused by 1. Excessive amount of resident docker images on the node (see bellow) 2. This can cause and will be further amplified by the spurious interfaces on OpenVSwich (see bellow) 3. Another side effect is exhausing IPs in pod network on the node as their are also not cleaned properly (see bellow). As IPs get exhausted, scheduling penalities also rise and at some point pods will fail to schedule (but will be displayed as Ready) x. Nuanced issues between kubelet, docker, logging, networking and so on, with remediation of the issue sometimes being brutal (restarting all nodes etc, depending on the case). https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-496818225 * The problem is not bound to CronJobs, but having regular scheduled jobs make it presence significantly more visible. Furthermore, CronJobs especially scheduling fat containers, like ADEI, significantly add to the I/O load on the system and may cause more severe form. - After a while, the 'pods' schedulling may get more-and-more sluggish, in general or if assigned to a specific node. * The docker images are accumulating on the nodes over time. After a threshold it will start adding the latency to the operation of docker daemon, slow down the pod scheduling (on the affected nodes), and may cause other sever side effects. The problems will start appearing at around 500-1000 images accumulated at a specific node. With 2000-3000, it will get severe and almost unusable (3-5 minutes to start a pod). So, eventually the unused images should be cleaned oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm or alternatively per-node: docker rmi $(docker images --filter "dangling=true" -q --no-trunc) * Some images could be orphanned by OpenShift infrastructure (there was not a major number of orphaned images on KaaS yet). OpenShift supports 'hard' prunning to handle such images. https://docs.openshift.com/container-platform/3.7/admin_guide/pruning_resources.html * Even afterwards, a significant number of images may stay resident. There is two inter-related problems: 1. Docker infrastructure relies on the intermediate images. Consequently, very long Dockerfiles will create a LOT of images. 2. OpenShift keeps history of 'rc' which may refence several versions of old docker images. This will be not cleaned by the described approach. Furthermore, stopped containers lost by OpenShift infrastructure (see above) also prevent clean-up of the images Currenly, a dozen KDB pods produce about 200-300 images. In some cases, optimization of dockerfiles and, afterwards, a trough cleanup of old images may become necessity. The intermediate images can be found with 'docker images -a' (all images with as repository and the name), but there is no easy way to find pod populating them. One, but not very convinient is the following project (press F5 on startup): https://github.com/TomasTomecek/sen - In a more sever4 form, the 'pods' scheduling may fail all together on one (or more) of the nodes. After a long waiting, the 'oc logs' will report timeout. The 'oc describe' reports 'failed to create pod sandbox'. This can be caused by failure to clean-up after terminated pod properly. It causes rogue network interfaces to remain in OpenVSwitch fabric. * This can be determined by errors reported using 'ovs-vsctl show' or present in the log '/var/log/openvswitch/ovs-vswitchd.log' which may quickly grow over 100MB quickly. could not open network device vethb9de241f (No such device) * The work-around is to delete rogue interfaces with ovs-vsctl del-port br0 More info: ovs-ofctl -O OpenFlow13 show br0 ovs-ofctl -O OpenFlow13 dump-flows br0 This does not solve the problem, however. The new interfaces will get abandoned by OpenShift. * The issue is discussed here: https://bugzilla.redhat.com/show_bug.cgi?id=1518684 https://bugzilla.redhat.com/show_bug.cgi?id=1518912 - Another related problem causing long delays in pod scheduling is indicated by the following lines "failed to run CNI IPAM ADD: no IP addresses available in network" in the system logs: Jan 21 14:43:01 ipekatrin2 origin-node: E0121 14:43:01.066719 93115 remote_runtime.go:91] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "kdb-server-testing-180-build_katrin" network: CNI request failed with status 400: 'failed to run IPAM for 4b56e403e2757d38dca67831ce09e10bc3b3f442b6699c20dcd89556763e2d5d: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn Jan 21 14:43:01 ipekatrin2 origin-node: E0121 14:43:01.068021 93115 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "kdb-server-testing-180-build_katrin(65640902-3bd6-11ea-bbd6-0cc47adef0e6)" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "kdb-server-testing-180-build_katrin" network: CNI request failed with status 400: 'failed to run IPAM for 4b56e403e2757d38dca67831ce09e10bc3b3f442b6699c20dcd89556763e2d5d: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn * The problem that OpenShift (due to "ovs" problems or something else) fails to clean the network interfaces. You can check for the currently allocated ips in: /var/lib/cni/networks/openshift-sdn * This can be cleaned (but better not for cron, as I don't know what happens if IP is already assigned but the docker container is not running yet). Anyway it is better first to disable scheduling on the node (and may be even evict all running pods): oc adm manage-node --schedulable=false oc adm manage-node --evacuate for hash in $(tail -n +1 * | grep '^[A-Za-z0-9]*$' | cut -c 1-8); do if [ -z $(docker ps -a | grep $hash | awk '{print $1}') ]; then grep -ilr $hash ./; fi; done | xargs rm After this, the origin-node should be restarted and scheduling could be re-enabled systemctl restart origin-node oc adm manage-node --schedulable=true * It doesn't seem to be directly triggered by lost ovs interfaces (more interfaces are lost than ips). So, it is not possible to release IPs one by one. - After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to * kube-service-catalog/controller-manager * openshift-template-service-broker/api-server Normally, they should be deleted. Then, OpenShift will auto-restart pods and they likely will run without problems. for name in $(oc get pods -n openshift-template-service-broker | grep Error | awk '{ print $1 }' ); do oc -n openshift-template-service-broker delete po $name; done for name in $(oc get pods -n kube-service-catalog | grep Error | awk '{ print $1 }' ); do oc -n kube-service-catalog delete po $name; done - Other pods will fail with 'ImagePullBackOff' after cluster crash. The problem is that ImageStreams populated by 'builds' will not be recreated automatically. By default OpenShift docker registry is stored on ephemeral disks and is lost on crash. The build should be re-executed manually. oc -n adei start-build adei - Furthermore, after long outtages the CronJobs will stop functioning. The reason can be found by analyzing '/var/log/messages' or specially systemctl status origin-master-controllers it will contain something like: 'Cannot determine if / needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.' * The reason is that after 100 missed (or failed) launch periods it will stop trying to avoid excive load. The remedy is set 'startingDeadlineSeconds' which tells the system that if cronJob has failed to start in the allocated interval we stop trying until the next start period. Then, 100 is only counted the specified period. I.e. we should set period bellow the 'launch period / 100'. https://github.com/kubernetes/kubernetes/issues/45825 * The running CronJobs can be easily patched with oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 120 }}' - Sometimes there is rogue namespaces in 'deleting' state. This is also hundreds of reasons, but mainly * Crash of both masters during population / destruction of OpenShift resources * Running of 'oc adm diagnostics' It is unclear how to remove them manually, but it seems if we run * OpenShift upgrade, the namespaces are gone (but there could be a bunch of new problems). * ... i don't know if install, etc. May cause the trouble... - There is also rogue pods (mainly due to some problems with unmounting lost storage) remaining "Deleting" state, etc. There are two possible situations: * The containers are actually already terminated, but OpenShift is not aware of it for some reason. * The containers are actually still running, but OpenShift is not able to terminate them for some reason. It is relatively easy to find out which is the case: * Finding the host running the failed pod with 'oc get pods -o wide' * Checking if associated containers are still running on the host with 'docker ps' The first case it relatively easy to handle, - one can simply enforce pod removal with oc delete --grace-period=0 --force In the second case we need * To actually stop containers before proceeding (enforcing will just leave them running forever). This can be done directly using 'docker' commands. * It also may be worth trying to clean associated resources. Check 'maintenace' documentation for details. - Permission problems will arise if non-KaaS namespace (using high range supplemental-group for GlusterFS mounts) is converted to KaaS (gid ranges within 1000 - 10,000 at the moment). The allowed gids should be configured in the namespace specification and the pods should be allowed to access files. Possible errors: unable to create pods: pods "mongodb-2-" is forbidden: no providers available to validate pod request Pod Networking ============== - Run commands in pod network, particularly execute packet sniffers (which would not work in the container due to missing capabilities) * Get container cid docker ps -f label=io.kubernetes.pod.name=epics-archappl-46-h6j62 -f label=io.kubernetes.pod.namespace=epics -f label=io.kubernetes.docker.type=podsandbox -q * Run command with container networking, e.g. tcpdump nsenter -n -t $(docker inspect --format "{{ .State.Pid }}" "f5a0ad4f5793") tcpdump -nv -i eth0 - Check if service properly exposed ports * 'nc' will not properly tell if UDP port is open or not due to underlaying firewall. iptables -n -L -t nat | grep 5064 | grep 172 Builds ====== - After changing storage for integrated docker registry, it may refuse builds with HTTP error 500. It is necessary to run: oadm policy reconcile-cluster-roles Storage ======= - The offline bricks can be brough back into the service with the follwoing command gluster volume start openshift force If this doesn't help, the volume should be stopped and started again gluster volume stop openshift gluster volume start openshift This might cause problems to the services. Likely pods will continue to run, but will be not be able to access mounted volumes. Particularly, adei-frontends/adei-cachers are affected. So, this services have to be restarted manually in some cases. - Running a lot of pods may exhaust available storage. It worth checking if * There is enough Docker storage for containers (lvm) * There is enough Heketi storage for dynamic volumes (lvm) * The root file system on nodes still has space for logs, etc. Particularly there is a big problem for ansible-ran virtual machines. The system disk is stored under '/root/VirtualBox VMs' and is not cleaned/destroyed unlike second hard drive on 'vagrant destroy'. So, it should be cleaned manually. - Too many parallel mounts (above 500 per node) may cause systemd slow-down/crashes. It is indicated by the following messages in the log: E0926 09:29:50.744454 93115 mount_linux.go:172] Mount failed: exit status 1 Output: Failed to start transient scope unit: Connection timed out * Solution is unclear, there are some suggestions to use 'setsid' in place of 'systemd-run' to do mounting, but not clear how. Discussion: https://github.com/kubernetes/kubernetes/issues/79194 * Can we do some rate-limiting? - Problems with pvc's can be evaluated by running oc -n openshift-ansible-service-broker describe pvc etcd Furthermore it worth looking in the folder with volume logs. For each 'pv' it stores subdirectories with pods executed on this host which are mount this pod and holds the log for this pods. /var/lib/origin/plugins/kubernetes.io/glusterfs/ - Heketi is problematic. * Worth checking if topology is fine and running. heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)" - Furthermore, the heketi gluster volumes may be started, but with multiple bricks offline. This can be checked with gluster volume status detail * If not all bricks online, likely it is just enought to restart the volume gluster volume stop gluster volume start * This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd' - If something gone wrong, heketi may end-up creating a bunch of new volumes, corrupt database, and crash refusing to start. Here is the recovery procedure. * Sometimes, it is still possible to start by setting 'HEKETI_IGNORE_STALE_OPERATIONS' environmental variable on the container. oc -n glusterfs env dc heketi-storage -e HEKETI_IGNORE_STALE_OPERATIONS=true * Even if it works, it does not solve the main issue with corruption. It is necessary to start a debugging pod for heketi (oc debug) export corrupted databased, fix it, and save back. Having database backup could save a lot of hussle to find that is amiss. heketi db export --dbfile heketi.db --jsonfile /tmp/q.json oc cp glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q.json q.json cat q.json | python -m json.tool > q2.json ...... Fixing ..... oc cp q2.json glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q2.json heketi db import --dbfile heketi2.db --jsonfile /tmp/q2.json cp heketi2.db /var/lib/heketi/heketi.db * If bunch of disks is created, there are still various left-overs. First, the Gluster volumes have to be cleaned. The idea is to compare 'vol_' prefixed volumes in Heketi and Gluster. And remove ones not present in heketi. There is the script in 'ands/scripts'. * There is LVM volumes left from Gluster (or even allocated, but not associated with Gluster for various failurs. so this clean-up is worth making independently). On each node we can easily find volumes created today lvdisplay -o name,time -S 'time since "2018-03-16"' or again we can compare lvm volumes which are used by Gluster bricks and which are not. The later ones should be cleaned up. Again there is the script. - Status of RAID can be checked with storcli utility /opt/MegaRAID/storcli/storcli64 /c0/v0 show /opt/MegaRAID/storcli/storcli64 /c0 show to further check smart attributes of a specific disk, one first needs find a corresponding DID of the disk. This is reported by storcli64 in JSON output mode /opt/MegaRAID/storcli/storcli64 /c0 /eall /sall show j Then, smartmontools can be used to extract smart attributes from this disk smartctl --all --device megaraid, /dev/sda MySQL ===== - MySQL may stop connecting to the master. While correct username and password are set in the container environment, mysql slave actually uses values stored somewhere in the database. It seems occasionally a corruption might happen causing authentication to break and slave report error 1045 Error connecting to master. retry-time: retries: 1, Error_code: 1045 The work-around is to set correct username/password again (using the values from environment): STOP SLAVE; CHANGE MASTER TO MASTER_USER='replication' MASTER_PASSWORD='...'; START SLAVE; This should fix it and the new username/password will be remembered over restarts... - MySQL may stop replicating from the master. There is some kind of deadlock in multi-threaded SLAVE SQL. This can be seen by exexuting (which should show a lot of slave threads waiting on coordinator to provide the load). SHOW PROCESSLIST; The remedy is to restart slave MySQL with 'slave_parallel_workers=0', give it a time to go, and then restart back in the standard multithreading mode. This can be achieved by editing 'statefulset/mysql-slave' and setting environmental vairable 'MYSQL_SLAVE_WORKERS' to 0 and, then, back to original value (16 currently). - This could be not end of this. The execution of statments from the log could 'stuck' because of the some "bad" transaction. This can be detected by looking into the replication status on the slave database: SHOW SLAVE STATUS\G It may happen what while 'Retrieved_Gtid_Set' increments (binary log is successfully transfered from the master and contains new transactions), the 'Executed_Gtid_Set' staying constant, i.e. new transactions are not applied to the slave server. Also, in this case 'Relay_Log_Pos' and 'Exec_Master_Log_Pos' will not advance. And the 'Slave_SQL_Running_State' will likely stuck in the 'Reading event from the relay log' state. The solution is to skip misbehaving transaction. Basically, we need to skip problematic transaction and to continue from the next one. The following considers that GTID_MODE is 'on' as we have it on the current KaaS platform. * First, we need to find the next transaction. Likely it the just an increment of the 'Executed_Gtid_Set', i.e. if 'Executed_Gtid_Set=3beaee24-2e55-11e8-9612-0a580a80000c:1-953787306' on the slave, the next GTID would be: 'Executed_Gtid_Set=3beaee24-2e55-11e8-9612-0a580a80000c:953787307' (yes "1-" also goes away). But this is not guaranteed. The more reliable way to find the misbehaving transaction is to look into the 'bin-log' on the master server (information is taken from SHOW SLAVE STATUS executed on the slave). In '/var/lib/mysql/data' run mysqlbinlog --start-position= | head -n 50 Here you will find which transaction has likely caused the problem. Furhermore, there will be the line looking like SET @@SESSION.GTID_NEXT='4ab8feff-5272-11e8-9320-08002715584a:201840' This is the gtid of the next transaction. * So, the following commands should be executed on the slave MySQL server (see details, https://www.thegeekdiary.com/how-to-skip-a-transaction-on-mysql-replication-slave-when-gtids-are-enabled/) STOP SLAVE; SET @@SESSION.GTID_NEXT=''; BEGIN; COMMIT; SET GTID_NEXT='AUTOMATIC'; START SLAVE; * It is also possible to review the stuck transaction on the slave mysql node. In the '/var/lib/mysql/data' run mysqlbinlog --start-position= Administration ============== - Some management tasks may require to login on ipekatrin* nodes. Thereafter, the password-less execution of 'oc' may fail on master nodes complaining on invalid authentication token. To fix it, it is necessary to check /root/.kube/config and remove references on logged users keeping only 'system:admin/kaas-kit-edu:8443' alkso check listed contexts and current-context. Performance =========== - To find if OpenShift restricts the usage of system resources, we can 'rsh' to container and check cgroup limits in sysfs /sys/fs/cgroup/cpuset/cpuset.cpus /sys/fs/cgroup/memory/memory.limit_in_bytes Various ======= - IPMI may cause problems as well. Particularly, the mounted CDrom may start complaining. Easiest is just to remove it from the running system with echo 1 > /sys/block/sdd/device/delete - 'oc get scc' reports the server doesn't have a resource type "scc" Delete (will be restarted) 'apiserver-*' pod in the 'kube-service-catalog' namespace