docs/maintenance.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

Unused resources
================
 ! Cleaning of images is necessary if amount of resident images grow above 1000. Everything else has not caused problems yet and could
 be ignored unless blocking other actions (e.g. clean-up of old images)

 - Deployments. As is this hasn't caused problems yet, but old versions of 'rc' may block removal of the old images and this may
 have negative impact on performance.
        oc adm prune deployments --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm
        oc adm prune builds --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm
    * This is, however, does not clean old 'rc' controllers which are allowed by 'revisionHistoryLimit' (and may be something else as
    well). There is a script included to clean such controllers 'prunerc.sh'

 - OpenShift sometimes fails to clean stopped containers. This containers again may block removal of images (and likely on itself also
 can use Docker performance penalties if accumulated).
    * The lost containers can be identified by looking into the /var/log/messages. 
        PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
    *  We can find and remove the corresponding container (the short id is just first letters of the long id)
        docker ps -a | grep aa28e9c76
        docker rm <id>
    * But in general any not-running container which is for a long time remains in stopped state could be considered lost. We can remove
    all of them or just ones related to the specific image (if we are cleaning images and something blocks deletion of an old version)
        docker rm $(docker ps -a | grep Exited | grep adei | awk '{ print $1 }')

 - If cleaning containers manually or/and forcing termination of pods, some remnants could be left in '/var/lib/origin/openshift.local.volumes/pods' 
    * Probably, it is also could happen in other cases. This can be detected by looking in /var/log/messages for something like
            Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
    * If unknown, the location for the pod in question could be found with 'find . -name heketi*' or something like (the containers names will be listed 
    under this subdirectory, so they can be used in search)...
    * There could be problematic mounts which can be freed with lazy umount
    * The folders for removed pods may (and should) be removed.

 - Prunning unused images (this is required as if large amount is accumulated, the additional latencies in communication with docker
 daemon will be inrtoduced and result in severe penalties to scheduling performance). Official way to clean unused images is
         oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm
    * This is, however, will keep all images referenced by exisitng bc, dc, rc, and pods (see above). So, it could be worth cleaning OpenShift resources
      before before proceeding with images. If images doesn't go, it worth also tryig to clean orphaned containers.
    * Some images could be also orphanned by OpenShift infrastructure.  OpenShift supports 'hard' prunning to handle such images.
        https://docs.openshift.com/container-platform/3.7/admin_guide/pruning_resources.html
      First check if something needs to be done:
        oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=check
      If there is many orphans, the hard pruning can be executed. This requires additional permissions 
      for service account running docker-registry
        service_account=$(oc get -n default -o jsonpath=$'system:serviceaccount:{.metadata.namespace}:{.spec.template.spec.serviceAccountName}\n' dc/docker-registry)
        oc adm policy add-cluster-role-to-user system:image-pruner ${service_account}
       and  should be done with docker registry in read-only mode (requires restart of default/docker-registry containers)
        oc env -n default dc/docker-registry 'REGISTRY_STORAGE_MAINTENANCE_READONLY={"enabled":true}'               # wait until new pods rolled out
        oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=delete
        oc env -n default dc/docker-registry REGISTRY_STORAGE_MAINTENANCE_READONLY-

 - Cleaning old images which doesn't want to go.
    * Investigating image streams and manually deleting the old versions of the images
        oc get is adei -o yaml
        oc delete image sha256:04afd4d4a0481e1510f12d6d071f1dceddef27416eb922cf524a61281257c66e
    * Cleaning old dangling images using docker (on all nodes). Tried and as it seems caused no issues to the operation of the cluster.
        docker rmi $(docker images --filter "dangling=true" -q --no-trunc)

 - Cleaning log files over-using inodes, etc.
    * Volume log files
        find  /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/ -name '*.log' -delete