1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
|
The services has to be running
------------------------------
Etcd:
- etcd
Node:
- origin-node
Master nodes:
- origin-master-api
- origin-master-controllers
- origin-master is not running
Required Services:
- lvm2-lvmetad.socket
- lvm2-lvmetad.service
- docker
- NetworkManager
- firewalld
- dnsmasq
- openvswitch
Extra Services:
- ssh
- ntp
- openvpn
- ganesha (on master nodes, optional)
Pods has to be running
----------------------
Kubernetes System
- kube-service-catalog/apiserver
- kube-service-catalog/controller-manager
OpenShift Main Services
- default/docker-registry
- default/registry-console
- default/router (3 replicas)
- openshift-template-service-broker/api-server (daemonset, on all nodes)
OpenShift Secondary Services
- openshift-ansible-service-broker/asb
- openshift-ansible-service-broker/asb-etcd
GlusterFS
- glusterfs-storage (daemonset, on all storage nodes)
- glusterblock-storage-provisioner-dc
- heketi-storage
Metrics (openshift-infra):
- hawkular-cassandra
- hawkular-metrics
- heapster
Debugging
=========
- Ensure system consistency as explained in 'consistency.txt' (incomplete)
- Check current pod logs and possibly logs for last failed instance
oc logs <pod name> --tail=100 [-p] - dc/name or ds/name as well
- Verify initialization steps (check if all volumes are mounted)
oc describe <pod name>
- Security (SCC) problems are visible if replica controller is queried
oc -n adei get rc/mysql-1 -o yaml
- It worth looking the pod environment
oc env po <pod name> --list
- It worth connecting running container with 'rsh' session and see running processes,
internal logs, etc. The 'debug' session will start a new instance of the pod.
- If try looking if corresponding pv/pvc are bound. Check logs for pv.
* Even if 'pvc' is bound. The 'pv' may have problems with its backend.
* Check logs here: /var/lib/origin/plugins/kubernetes.io/glusterfs/
- Another frequent problems is failing 'postStart' hook. Or 'livenessProbe'. As it
immediately crashes it is not possible to connect. Remedies are:
* Set larger initial delay to check the probe.
* Try to remove hook and execute it using 'rsh'/'debug'
- Determine node running the pod and check the host logs in '/var/log/messages'
* Particularly logs of 'origin-master-controllers' are of interest
- Check which docker images are actually downloaded on the node
docker images
network
=======
- There is a NetworkManager script which should adjust /etc/resolv.conf to use local dnsmasq server.
This is based on '/etc/NetworkManager/dispatcher.d/99-origin-dns.sh' which does not play well
if OpenShift is running on non-default network interface. I provided a patched version, but it
worth verifying
* that nameserver is pointing to the host itself (but not localhost, this is important
to allow running pods to use it)
* that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf'
* that correct upstream nameservers are listed in '/etc/origin/node/resolv.conf'
* In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons)
If script misbehaves, it is possible to call it manually like that
DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up
etcd (and general operability)
====
- Few of this sevices may seem running accroding to 'systemctl', but actually misbehave. Then, it
may be needed to restart them manually. I have noticed it with
* lvm2-lvmetad.socket (pvscan will complain on problems)
* node-origin
* glusterd in container (just kill the misbehaving pod, it will be recreated)
* etcd but BEWARE of too entusiastic restarting:
- However, restarting etcd many times is BAD as it may trigger a severe problem with
'kube-service-catalog/apiserver'. The bug description is here
https://github.com/kubernetes/kubernetes/issues/47131
- Due to problem mentioned above, all 'oc' queries are very slow. There is not proper
solution suggested. But killing the 'kube-service-catalog/apiserver' helps for a while.
The pod is restarted and response times are back in order.
* Another way to see this problem is quering 'healthz' service which would tell that
there is too many clients and, please, retry later.
curl -k https://apiserver.kube-service-catalog.svc/healthz
- On node crash, the etcd database may get corrupted.
* There is no easy fix. Backup/restore is not working.
* Easiest option is to remove the failed etcd from the cluster.
etcdctl3 --endpoints="192.168.213.1:2379" member list
etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid>
* Add it to [new_etcd] section in inventory and run openshift-etcd to scale-up etcd cluster.
- There is a helth check provided by the cluster
curl -k https://apiserver.kube-service-catalog.svc/healthz
it may complain about etcd problems. It seems triggered by OpenShift upgrade. The real cause and
remedy is unclear, but the installation is mostly working. Discussion is in docs/upgrade.txt
- There is also a different etcd which is integral part of the ansible service broker:
'openshift-ansible-service-broker/asb-etcd'. If investigated with 'oc logs' it complains
on:
2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
WARNING: 2018/03/07 20:54:48 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
Nevertheless, it seems working without much trouble. The error message seems caused by
certificate verification code which introduced in etcd 3.2. There are multiple bug repports on
the issue.
pods (failed pods, rogue namespaces, etc...)
====
- After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to
* kube-service-catalog/controller-manager
* openshift-template-service-broker/api-server
Normally, they should be deleted. Then, OpenShift will auto-restart pods and they likely will run without problems.
for name in $(oc get pods -n openshift-template-service-broker | grep Error | awk '{ print $1 }' ); do oc -n openshift-template-service-broker delete po $name; done
for name in $(oc get pods -n kube-service-catalog | grep Error | awk '{ print $1 }' ); do oc -n kube-service-catalog delete po $name; done
- Other pods will fail with 'ImagePullBackOff' after cluster crash. The problem is that ImageStreams populated by 'builds' will
not be recreated automatically. By default OpenShift docker registry is stored on ephemeral disks and is lost on crash. The build should be
re-executed manually.
oc -n adei start-build adei
- Furthermore, after long outtages the CronJobs will stop functioning. The reason can be found by analyzing '/var/log/messages' or specially
systemctl status origin-master-controllers
it will contain something like:
'Cannot determine if <namespace>/<cronjob> needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.'
* The reason is that after 100 missed (or failed) launch periods it will stop trying to avoid excive load. The remedy is set 'startingDeadlineSeconds'
which tells the system that if cronJob has failed to start in the allocated interval we stop trying until the next start period. Then, 100 is only
counted the specified period. I.e. we should set period bellow the 'launch period / 100'.
https://github.com/kubernetes/kubernetes/issues/45825
* The running CronJobs can be easily patched with
oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 120 }}'
- Sometimes there is rogue namespaces in 'deleting' state. This is also hundreds of reasons, but mainly
* Crash of both masters during population / destruction of OpenShift resources
* Running of 'oc adm diagnostics'
It is unclear how to remove them manually, but it seems if we run
* OpenShift upgrade, the namespaces are gone (but there could be a bunch of new problems).
* ... i don't know if install, etc. May cause the trouble...
- There is also rogue pods (mainly due to some problems with unmounting lost storage), etc. If 'oc delete' does not
work for a long time. It worth
* Determining the host running failed pod with 'oc get pods -o wide'
* Going to the pod and killing processes and stopping the container using docker command
* Looking in the '/var/lib/origin/openshift.local.volumes/pods' for the remnants of the container
- This can be done with 'find . -name heketi*' or something like...
- There could be problematic mounts which can be freed with lazy umount
- The folders for removed pods may (and should) be removed.
- Looking into the '/var/log/messages', it is sometimes possible to spot various erros like
* Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
The volumes can be removed in '/var/lib/origin/openshift.local.volumes/pods' on the corresponding node
* PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
- We can find and remove the corresponding container (the short id is just first letters of the long id)
docker ps -a | grep aa28e9c76
docker rm <id>
- We further can just destroy all containers which are not running (it will actually try to remove all,
but just error message will be printed for running ones)
docker ps -aq --no-trunc | xargs docker rm
Builds
======
- After changing storage for integrated docker registry, it may refuse builds with HTTP error 500. It is necessary
to run:
oadm policy reconcile-cluster-roles
Storage
=======
- Running a lot of pods may exhaust available storage. It worth checking if
* There is enough Docker storage for containers (lvm)
* There is enough Heketi storage for dynamic volumes (lvm)
* The root file system on nodes still has space for logs, etc.
Particularly there is a big problem for ansible-ran virtual machines. The system disk is stored
under '/root/VirtualBox VMs' and is not cleaned/destroyed unlike second hard drive on 'vagrant
destroy'. So, it should be cleaned manually.
- Problems with pvc's can be evaluated by running
oc -n openshift-ansible-service-broker describe pvc etcd
Furthermore it worth looking in the folder with volume logs. For each 'pv' it stores subdirectories
with pods executed on this host which are mount this pod and holds the log for this pods.
/var/lib/origin/plugins/kubernetes.io/glusterfs/
- Heketi is problematic.
* Worth checking if topology is fine and running.
heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)"
- Furthermore, the heketi gluster volumes may be started, but with multiple bricks offline. This can
be checked with
gluster volume status <vol> detail
* If not all bricks online, likely it is just enought to restart the volume
gluster volume stop <vol>
gluster volume start <vol>
* This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd'
- If something gone wrong, heketi may end-up creating a bunch of new volumes, corrupt database, and crash
refusing to start. Here is the recovery procedure.
* Sometimes, it is still possible to start by setting 'HEKETI_IGNORE_STALE_OPERATIONS' environmental
variable on the container.
oc -n glusterfs env dc heketi-storage -e HEKETI_IGNORE_STALE_OPERATIONS=true
* Even if it works, it does not solve the main issue with corruption. It is necessary to start a
debugging pod for heketi (oc debug) export corrupted databased, fix it, and save back. Having
database backup could save a lot of hussle to find that is amiss.
heketi db export --dbfile heketi.db --jsonfile /tmp/q.json
oc cp glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q.json q.json
cat q.json | python -m json.tool > q2.json
...... Fixing .....
oc cp q2.json glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q2.json
heketi db import --dbfile heketi2.db --jsonfile /tmp/q2.json
cp heketi2.db /var/lib/heketi/heketi.db
* If bunch of disks is created, there are still various left-overs. First, the Gluster volumes
have to be cleaned. The idea is to compare 'vol_' prefixed volumes in Heketi and Gluster. And
remove ones not present in heketi. There is the script in 'ands/scripts'.
* There is LVM volumes left from Gluster (or even allocated, but not associated with Gluster for
various failurs. so this clean-up is worth making independently). On each node we can easily find
volumes created today
lvdisplay -o name,time -S 'time since "2018-03-16"'
or again we can compare lvm volumes which are used by Gluster bricks and which are not. The later
ones should be cleaned up. Again there is the script.
Performance
===========
- To find if OpenShift restricts the usage of system resources, we can 'rsh' to container and check
cgroup limits in sysfs
/sys/fs/cgroup/cpuset/cpuset.cpus
/sys/fs/cgroup/memory/memory.limit_in_bytes
Various
=======
- IPMI may cause problems as well. Particularly, the mounted CDrom may start complaining. Easiest is
just to remove it from the running system with
echo 1 > /sys/block/sdd/device/delete
|