diff options
Diffstat (limited to 'docs/troubleshooting.txt')
-rw-r--r-- | docs/troubleshooting.txt | 49 |
1 files changed, 49 insertions, 0 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt index b4ac8e7..ef3c206 100644 --- a/docs/troubleshooting.txt +++ b/docs/troubleshooting.txt @@ -60,6 +60,8 @@ Debugging oc logs <pod name> --tail=100 [-p] - dc/name or ds/name as well - Verify initialization steps (check if all volumes are mounted) oc describe <pod name> + - Security (SCC) problems are visible if replica controller is queried + oc -n adei get rc/mysql-1 -o yaml - It worth looking the pod environment oc env po <pod name> --list - It worth connecting running container with 'rsh' session and see running processes, @@ -85,6 +87,7 @@ network * that nameserver is pointing to the host itself (but not localhost, this is important to allow running pods to use it) * that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf' + * that correct upstream nameservers are listed in '/etc/origin/node/resolv.conf' * In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons) If script misbehaves, it is possible to call it manually like that DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up @@ -96,6 +99,7 @@ etcd (and general operability) may be needed to restart them manually. I have noticed it with * lvm2-lvmetad.socket (pvscan will complain on problems) * node-origin + * glusterd in container (just kill the misbehaving pod, it will be recreated) * etcd but BEWARE of too entusiastic restarting: - However, restarting etcd many times is BAD as it may trigger a severe problem with 'kube-service-catalog/apiserver'. The bug description is here @@ -181,6 +185,13 @@ pods (failed pods, rogue namespaces, etc...) docker ps -aq --no-trunc | xargs docker rm +Builds +====== + - After changing storage for integrated docker registry, it may refuse builds with HTTP error 500. It is necessary + to run: + oadm policy reconcile-cluster-roles + + Storage ======= - Running a lot of pods may exhaust available storage. It worth checking if @@ -208,3 +219,41 @@ Storage gluster volume start <vol> * This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd' + - If something gone wrong, heketi may end-up creating a bunch of new volumes, corrupt database, and crash + refusing to start. Here is the recovery procedure. + * Sometimes, it is still possible to start by setting 'HEKETI_IGNORE_STALE_OPERATIONS' environmental + variable on the container. + oc -n glusterfs env dc heketi-storage -e HEKETI_IGNORE_STALE_OPERATIONS=true + * Even if it works, it does not solve the main issue with corruption. It is necessary to start a + debugging pod for heketi (oc debug) export corrupted databased, fix it, and save back. Having + database backup could save a lot of hussle to find that is amiss. + heketi db export --dbfile heketi.db --jsonfile /tmp/q.json + oc cp glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q.json q.json + cat q.json | python -m json.tool > q2.json + ...... Fixing ..... + oc cp q2.json glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q2.json + heketi db import --dbfile heketi2.db --jsonfile /tmp/q2.json + cp heketi2.db /var/lib/heketi/heketi.db + * If bunch of disks is created, there are still various left-overs. First, the Gluster volumes + have to be cleaned. The idea is to compare 'vol_' prefixed volumes in Heketi and Gluster. And + remove ones not present in heketi. There is the script in 'ands/scripts'. + * There is LVM volumes left from Gluster (or even allocated, but not associated with Gluster for + various failurs. so this clean-up is worth making independently). On each node we can easily find + volumes created today + lvdisplay -o name,time -S 'time since "2018-03-16"' + or again we can compare lvm volumes which are used by Gluster bricks and which are not. The later + ones should be cleaned up. Again there is the script. + +Performance +=========== + - To find if OpenShift restricts the usage of system resources, we can 'rsh' to container and check + cgroup limits in sysfs + /sys/fs/cgroup/cpuset/cpuset.cpus + /sys/fs/cgroup/memory/memory.limit_in_bytes + + +Various +======= + - IPMI may cause problems as well. Particularly, the mounted CDrom may start complaining. Easiest is + just to remove it from the running system with + echo 1 > /sys/block/sdd/device/delete |