8 files changed, 203 insertions, 31 deletions
diff --git a/docs/consistency.txt b/docs/consistency.txt
index 91a0ee7..3769a60 100644
--- a/docs/consistency.txt
+++ b/docs/consistency.txt
@@ -9,6 +9,10 @@ General overview
     oc get pvc --all-namespaces -o wide
  - API health check
     curl -k https://apiserver.kube-service-catalog.svc/healthz
+ - Docker status (at each node)
+    docker info
+    * Enough Data and Metadata Space is available 
+    * The number of resident images is in check (>500-1000 - bad, >2000-3000 - terrible)
 
 Nodes
 =====
@@ -31,7 +35,7 @@ Storage
 Networking
 ==========
  - Check that correct upstream name servers are listed for both DNSMasq (host) and SkyDNS (pods).
- If not fix and restart 'origin-node' and 'dnsmasq'.
+ If not fix and restart 'origin-node' and 'dnsmasq' (it happens that DNSMasq is just stuck).
     * '/etc/dnsmasq.d/origin-upstream-dns.conf'
     * '/etc/origin/node/resolv.conf'
 
@@ -46,12 +50,14 @@ Networking
  - Ensure, we don't have override of cluster_name to first master (which we do during the
  provisioning of OpenShift plays)
 
- - Sometimes OpenShift fails to clean-up after terminated pod properly. This causes rogue
- network interfaces to remain in OpenVSwitch fabric. This can be determined by errors like:
+ - Sometimes OpenShift fails to clean-up after terminated pod properly (this problem is particularly
+ triggered on the systems with huge number of resident docker images). This causes rogue network 
+ interfaces to  remain in OpenVSwitch fabric. This can be determined by errors like:
     could not open network device vethb9de241f (No such device)
  reported by 'ovs-vsctl show' or present in the log '/var/log/openvswitch/ovs-vswitchd.log' 
  which may quickly grow over 100MB quickly. If number of rogue interfaces grows too much,
- the pod scheduling will start time-out on the affected node. 
+ the pod scheduling gets even worse (compared to delays caused only be docker images) and 
+ will start time-out on the affected node. 
   * The work-around is to delete rogue interfaces with 
     ovs-vsctl del-port br0 <iface>
  This does not solve the problem, however. The new interfaces will get abandoned by OpenShift.
diff --git a/docs/logs.txt b/docs/logs.txt
index e27b1ff..d33ef0a 100644
--- a/docs/logs.txt
+++ b/docs/logs.txt
@@ -2,6 +2,10 @@
 =================
  - Various RPC errors. 
     ... rpc error: code = # desc = xxx ...
+
+ - PLEG is not healthy: pleg was last seen active 3m0.448988393s ago; threshold is 3m0s
+    This is severe and indicates communication probelm (or at least high latency) with docker daemon. As result the node can be marked
+    temporary NotReady and cause eviction of all resident pods.
  
  - container kill failed because of 'container not found' or 'no such process': Cannot kill container ###: rpc error: code = 2 desc = no such process"
     Despite the errror, the containers are actually killed and pods destroyed. However, this error likely triggers
@@ -25,10 +29,14 @@
     There are no adverse effects to this.  It is a potential kernel issue, but should be just ignored by the customer.  Nothing is going to break.
         https://bugzilla.redhat.com/show_bug.cgi?id=1425278
 
-
  - E0625 03:59:52.438970   23953 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
     seems fine and can be ignored.
 
+ - E0926 09:29:50.744454   93115 mount_linux.go:172] Mount failed: exit status 1
+   Output: Failed to start transient scope unit: Connection timed out
+    It seems caused by too many parallel mounts (about 500 per-node) may cause systemd to hang. 
+    Details: https://github.com/kubernetes/kubernetes/issues/79194
+        * Suggested to use 'setsid' to mount volumes instead of 'systemd-run'
     
 /var/log/openvswitch/ovs-vswitchd.log
 =====================================
diff --git a/docs/maintenance.txt b/docs/maintenance.txt
new file mode 100644
index 0000000..9f52e18
--- /dev/null
+++ b/docs/maintenance.txt
@@ -0,0 +1,55 @@
+Unused resources
+================
+ ! Cleaning of images is necessary if amount of resident images grow above 1000. Everything else has not caused problems yet and could
+ be ignored unless blocking other actions (e.g. clean-up of old images)
+
+ - Deployments. As is this hasn't caused problems yet, but old versions of 'rc' may block removal of the old images and this may
+ have negative impact on performance.
+        oc adm prune deployments --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm
+        oc adm prune builds --orphans --keep-complete=3 --keep-failed=1 --keep-younger-than=60m --confirm
+    * This is, however, does not clean old 'rc' controllers which are allowed by 'revisionHistoryLimit' (and may be something else as
+    well). There is a script included to clean such controllers 'prunerc.sh'
+
+ - OpenShift sometimes fails to clean stopped containers. This containers again may block removal of images (and likely on itself also
+ can use Docker performance penalties if accumulated).
+    * The lost containers can be identified by looking into the /var/log/messages. 
+        PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
+    *  We can find and remove the corresponding container (the short id is just first letters of the long id)
+        docker ps -a | grep aa28e9c76
+        docker rm <id>
+    * But in general any not-running container which is for a long time remains in stopped state could be considered lost. We can remove
+    all of them or just ones related to the specific image (if we are cleaning images and something blocks deletion of an old version)
+        docker rm $(docker ps -a | grep Exited | grep adei | awk '{ print $1 }')
+
+ - If cleaning containers manually or/and forcing termination of pods, some remnants could be left in '/var/lib/origin/openshift.local.volumes/pods' 
+    * Probably, it is also could happen in other cases. This can be detected by looking in /var/log/messages for something like
+            Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
+    * If unknown, the location for the pod in question could be found with 'find . -name heketi*' or something like (the containers names will be listed 
+    under this subdirectory, so they can be used in search)...
+    * There could be problematic mounts which can be freed with lazy umount
+    * The folders for removed pods may (and should) be removed.
+
+ - Prunning unused images (this is required as if large amount is accumulated, the additional latencies in communication with docker
+ daemon will be inrtoduced and result in severe penalties to scheduling performance). Official way to clean unused images is
+         oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm
+    * This is, however, will keep all images referenced by exisitng bc, dc, rc, and pods (see above). So, it could be worth cleaning OpenShift resources
+      before before proceeding with images. If images doesn't go, it worth also tryig to clean orphaned containers.
+    * Some images could be also orphanned by OpenShift infrastructure.  OpenShift supports 'hard' prunning to handle such images.
+        https://docs.openshift.com/container-platform/3.7/admin_guide/pruning_resources.html
+      First check if something needs to be done:
+        oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=check
+      If there is many orphans, the hard pruning can be executed. This requires additional permissions 
+      for service account running docker-registry
+        service_account=$(oc get -n default -o jsonpath=$'system:serviceaccount:{.metadata.namespace}:{.spec.template.spec.serviceAccountName}\n' dc/docker-registry)
+        oc adm policy add-cluster-role-to-user system:image-pruner ${service_account}
+       and  should be done with docker registry in read-only mode (requires restart of default/docker-registry containers)
+        oc env -n default dc/docker-registry 'REGISTRY_STORAGE_MAINTENANCE_READONLY={"enabled":true}'               # wait until new pods rolled out
+        oc -n default exec -i -t "$(oc -n default get pods -l deploymentconfig=docker-registry -o jsonpath=$'{.items[0].metadata.name}\n')" -- /usr/bin/dockerregistry -prune=delete
+        oc env -n default dc/docker-registry REGISTRY_STORAGE_MAINTENANCE_READONLY-
+
+ - Cleaning old images which doesn't want to go.
+    * Investigating image streams and manually deleting the old versions of the images
+        oc get is adei -o yaml
+        oc delete image sha256:04afd4d4a0481e1510f12d6d071f1dceddef27416eb922cf524a61281257c66e
+    * Cleaning old dangling images using docker (on all nodes). Tried and as it seems caused no issues to the operation of the cluster.
+        docker rmi $(docker images --filter "dangling=true" -q --no-trunc)
diff --git a/docs/managment.txt b/docs/managment.txt
index cfc6aff..96ae559 100644
--- a/docs/managment.txt
+++ b/docs/managment.txt
@@ -43,6 +43,9 @@ DOs and DONTs
  - Few administrative tools could cause troubles. Don't run
     * oc adm diagnostics
 
+ - Old docker 1.12 has many problems. RedHat recommends updating. Don't do! OpenShift 3.7 is only supportingh
+ docker 1.12 and will not work with 1.13 or later.
+
 
 Failures / Immidiate
 ========
diff --git a/docs/problems.txt b/docs/problems.txt
index 1d729cd..e616fe4 100644
--- a/docs/problems.txt
+++ b/docs/problems.txt
@@ -7,7 +7,9 @@ Actions Required
 
 Rogue network interfaces on OpenVSwitch bridge
 ==============================================
- Sometimes OpenShift fails to clean-up after terminated pod properly. The actual reason is unclear.
+ Sometimes OpenShift fails to clean-up after terminated pod properly. The actual reason is unclear, but
+ severity of the problem is increased if extreme amount of images is presented in local Docker storage.
+ Several thousands is defenitively intensifies this problem.
   * The issue is discussed here:
         https://bugzilla.redhat.com/show_bug.cgi?id=1518684
   * And can be determined by looking into:
@@ -23,7 +25,8 @@ Rogue network interfaces on OpenVSwitch bridge
   * Even if not failed, it takes several minutes to schedule the pod on the affected nodes.
 
  Cause:
-  * Unclear, but it seems periodic ADEI cron jobs causes the issue.
+  * Unclear, but it seems periodic ADEI cron jobs causes the issue if many images are present
+  in docker.
   * Could be related to 'container kill failed' problem explained in the section bellow.
      Cannot kill container ###: rpc error: code = 2 desc = no such process
 
@@ -35,6 +38,8 @@ Rogue network interfaces on OpenVSwitch bridge
   * The simplest work-around is to just remove rogue interface. They will be re-created, but performance
   problems only starts after hundreds accumulate.
     ovs-vsctl del-port br0 <iface>
+  * It seems helpful to purge unused docker images to reduce the rate of interface apperance.
+  
   
  Status:
    * Cron job is installed which cleans rogue interfaces as they number hits 25.
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index ea987b5..2290901 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -134,9 +134,53 @@ etcd (and general operability)
  
 pods (failed pods, rogue namespaces, etc...)
 ====
- - The 'pods' scheduling may fail on one (or more) of the nodes after long waiting with 'oc logs' reporting
- timeout. The 'oc describe' reports 'failed to create pod sandbox'. This can be caused by failure to clean-up 
- after terminated pod properly. It causes rogue network interfaces to remain in OpenVSwitch fabric. 
+ - OpenShift has numerous problems with clean-up resources after the pods. The problems are more likely to happen on the 
+ heavily loaded systems: cpu, io, interrputs, etc.
+    * This may be indicated in the logs with various errors reporting inability to stop containers/processes, free network
+    and storage resources. A few examples (not complete)
+        dockerd-current: time="2019-09-30T18:46:12.298297013Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 00a456097fcf8d70a0461f05813e5a1f547446dd10b3b43ebc1f0bb09e841d1b: rpc error: code = 2 desc = no such process"
+        origin-node: W0930 18:46:11.286634    2497 util.go:87] Warning: Unmount skipped because path does not exist: /var/lib/origin/openshift.local.volumes/pods/6aecbed1-e3b2-11e9-bbd6-0cc47adef0e6/volumes/kubernetes.io~glusterfs/adei-tmp
+        Error syncing pod 1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6 ("adei-smartgrid-maintain-1569790800-pcmdp_adei(1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6)"), skipping: failed to "CreatePodSandbox" for "adei-smartgrid-maintain-1569790800-pcmdp_adei(1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6)" with CreatePodSandboxError: "CreatePodSandbox for pod \"adei-smartgrid-maintain-1569790800-pcmdp_adei(1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"adei-smartgrid-maintain-1569790800-pcmdp_adei\" network: CNI request failed with status 400: 'failed to Statfs \"/proc/28826/ns/net\": no such file or directory\n'"
+    * A more severe form is then PLEG (POD Lifecycle Event Generator) errors are reported:
+        origin-node: I0925 07:52:00.422291   93115 kubelet.go:1796] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.448988393s ago; threshold is 3m0s]
+    This indicates a severe delays in communication with docker daemon (can be checked with 'docker info') and may result in node marked
+    temporarily NotReady causing 'pod' eviction. As pod eviction causes extensive load on the other nodes (which may also be affected of the
+    same problem), the initial single-node issue may render all cluster unusable.
+    * With mass evictions, the things could get even worse causing faults in etcd communication. This is reported like:
+         etcd: lost the TCP streaming connection with peer 2696c5f68f35c672 (stream MsgApp v2 reader)
+    * Apart from overloaded nodes (max cpu%, io, interrupts), PLEG issues can be caused by 
+        1. Excessive amount of resident docker images on the node (see bellow)
+        2. This can cause and will be further amplified by the spurious interfaces on OpenVSwich (see bellow)
+        x. Nuanced issues between kubelet, docker,   logging, networking and so on, with remediation of the issue sometimes being brutal (restarting all nodes etc, depending on the case).
+            https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-496818225
+    * The problem is not bound to CronJobs, but having regular scheduled jobs make it presence significantly more visible. 
+    Furthermore, CronJobs especially scheduling fat containers, like ADEI, significantly add to the I/O load on the system 
+    and may cause more severe form.
+
+ - After a while, the 'pods' schedulling may get more-and-more sluggish, in general or if assigned to a specific node.
+  * The docker images are accumulating on the nodes over time. After a threshold it will start adding the latency to the 
+  operation of docker daemon, slow down the pod scheduling (on the affected nodes), and may cause other sever side effects. 
+  The problems will start appearing at around 500-1000 images accumulated at a specific node. With 2000-3000, it will get 
+  severe and almost unusable (3-5 minutes to start a pod). So, eventually the unused images should be cleaned
+    oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm
+  or alternatively per-node:
+    docker rmi $(docker images --filter "dangling=true" -q --no-trunc)
+  * Some images could be orphanned by OpenShift infrastructure (there was not a major number of orphaned images on KaaS yet).
+  OpenShift supports 'hard' prunning to handle such images.
+    https://docs.openshift.com/container-platform/3.7/admin_guide/pruning_resources.html
+  * Even afterwards, a significant number of images may stay resident. There is two inter-related problems:
+    1. Docker infrastructure relies on the intermediate images. Consequently, very long Dockerfiles will create a LOT of images. 
+    2. OpenShift keeps history of 'rc' which may refence several versions of old docker images. This will be not cleaned by the
+    described approach. Furthermore, stopped containers lost by OpenShift infrastructure (see above) also prevent clean-up of
+    the images
+  Currenly, a dozen KDB pods produce about 200-300 images. In some cases, optimization of dockerfiles and, afterwards, a trough
+  cleanup of old images may become necessity. The intermediate images can be found with 'docker images -a' (all images with
+  <none> as repository and the name), but there is no easy way to find pod populating them. One, but not very convinient is the following
+  project (press F5 on startup): https://github.com/TomasTomecek/sen
+
+ - In a more sever4 form, the 'pods' scheduling may fail all together on one (or more) of the nodes. After a long waiting,
+ the 'oc logs' will report timeout. The 'oc describe' reports 'failed to create pod sandbox'. This can be caused by failure 
+ to clean-up after terminated pod properly. It causes rogue network interfaces to remain in OpenVSwitch fabric. 
   * This can be determined by errors reported using 'ovs-vsctl show' or present in the log '/var/log/openvswitch/ovs-vswitchd.log' 
     which may quickly grow over 100MB quickly. 
         could not open network device vethb9de241f (No such device)
@@ -149,7 +193,7 @@ pods (failed pods, rogue namespaces, etc...)
   * The issue is discussed here:
         https://bugzilla.redhat.com/show_bug.cgi?id=1518684
         https://bugzilla.redhat.com/show_bug.cgi?id=1518912
-        
+
  - After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to
     * kube-service-catalog/controller-manager
     * openshift-template-service-broker/api-server
@@ -180,26 +224,24 @@ pods (failed pods, rogue namespaces, etc...)
     * OpenShift upgrade, the namespaces are gone (but there could be a bunch of new problems).
     * ... i don't know if install, etc. May cause the trouble...
 
- - There is also rogue pods (mainly due to some problems with unmounting lost storage), etc. If 'oc delete' does not
- work for a long time. It worth
-    * Determining the host running failed pod with 'oc get pods -o wide'
-    * Going to the pod and killing processes and stopping the container using docker command
-    * Looking in the '/var/lib/origin/openshift.local.volumes/pods' for the remnants of the container
-        - This can be done with 'find . -name heketi*' or something like...
-        - There could be problematic mounts which can be freed with lazy umount
-        - The folders for removed pods may (and should) be removed.
-
- - Looking into the '/var/log/messages', it is sometimes possible to spot various erros like
-    * Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
-        The volumes can be removed in '/var/lib/origin/openshift.local.volumes/pods' on the corresponding node
-    * PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
-        - We can find and remove the corresponding container (the short id is just first letters of the long id)
-                docker ps -a | grep aa28e9c76
-                docker rm <id>
-        - We further can just destroy all containers which are not running (it will actually try to remove all,
-        but just error message will be printed for running ones)
-                docker ps -aq --no-trunc | xargs docker rm
+ - There is also rogue pods (mainly due to some problems with unmounting lost storage) remaining "Deleting" state, etc. 
+ There are two possible situations:
+    * The containers are actually already terminated, but OpenShift is not aware of it for some reason.
+    * The containers are actually still running, but OpenShift is not able to terminate them for some reason.
+ It is relatively easy to find out which is the case:
+    * Finding the host running the failed pod with 'oc get pods -o wide'
+    * Checking if associated containers are still running on the host with 'docker ps'
+ The first case it relatively easy to handle, - one can simply enforce pod removal with
+        oc delete --grace-period=0 --force
+ In the second case we need 
+    * To actually stop containers before proceeding (enforcing will just leave them running forever). This can
+    be done directly using 'docker' commands.
+    * It also may be worth trying to clean associated resources. Check 'maintenace' documentation for details.
 
+ - Permission problems will arise if non-KaaS namespace (using high range supplemental-group for GlusterFS mounts) is converted
+ to KaaS (gid ranges within 1000 - 10,000 at the moment). The allowed gids should be configured in the namespace specification 
+ and the pods should be allowed to access files. Possible errors:
+    unable to create pods: pods "mongodb-2-" is forbidden: no providers available to validate pod request
 
 
 
@@ -219,6 +261,14 @@ Storage
   Particularly there is a big problem for ansible-ran virtual machines. The system disk is stored
   under '/root/VirtualBox VMs' and is not cleaned/destroyed unlike second hard drive on 'vagrant
   destroy'. So, it should be cleaned manually.
+
+ - Too many parallel mounts (above 500 per node) may cause systemd slow-down/crashes. It is indicated by
+ the following messages in the log:
+        E0926 09:29:50.744454   93115 mount_linux.go:172] Mount failed: exit status 1
+        Output: Failed to start transient scope unit: Connection timed out
+    * Solution is unclear, there are some suggestions to use 'setsid' in place of 'systemd-run' to do mounting,
+    but not clear how. Discussion: https://github.com/kubernetes/kubernetes/issues/79194
+    * Can we do some rate-limiting?
   
  - Problems with pvc's can be evaluated by running 
         oc  -n openshift-ansible-service-broker describe pvc etcd
@@ -271,7 +321,12 @@ MySQL
  The remedy is to restart slave MySQL with 'slave_parallel_workers=0', give it a time to go, and then
  restart back in the standard multithreading mode.
 
-
+Administration
+==============
+ - Some management tasks may require to login on ipekatrin* nodes. Thereafter, the password-less execution of
+ 'oc' may fail on master nodes complaining on invalid authentication token. To fix it, it is necessary to check
+ /root/.kube/config and remove references on logged users keeping only 'system:admin/kaas-kit-edu:8443' alkso check
+ listed contexts and current-context.
      
 Performance
 ===========
diff --git a/docs/users.txt b/docs/users.txt
new file mode 100644
index 0000000..c28f400
--- /dev/null
+++ b/docs/users.txt
@@ -0,0 +1,21 @@
+Dockerfiles
+===========
+ - Long (many-layer) Dockerfiles may cause a significant disruption to the OpenShift cluster (far behind just performance penalty
+ of working with this layers). Currently, it is imperative to reduce the number of intermediate images resident on the OpenShift nodes.
+    * The better approach is to optimize Dockerfiles: All ENV defined at once, as few ARG as possible, signle COPY, and a single RUN to
+    setup everything.
+    * Alternatively, the final image can be squashed with 'docker build --squash ...' (enable experimental features in docker daemon). 
+    However, this is incompatible with OpenShift build process.
+ If optimizing already running applications, it is not enough just to re-build images. Old images could be referenced by the old 
+ 'rc' left in the system or even stopped containers lost by OpenShift infrastructure. 
+    * Check if old images still present (https://github.com/TomasTomecek/sen is only application I am aware of capable of showing it)
+    * See maintenance section how to get rid if old images are still present
+
+Deployments
+===========
+ - CronJobs is currently a bit problematic and periodically cause some lost resources, etc. They can be used if necessarily, but it better
+ to minimize. In any case, it is crucial to minimize size of frequently scheduled Job containers (otherwise large I/O). 
+
+
+Storage
+=======
diff --git a/docs/vision.txt b/docs/vision.txt
new file mode 100644
index 0000000..bfef287
--- /dev/null
+++ b/docs/vision.txt
@@ -0,0 +1,19 @@
+Ands v.2
+========
+ - Try overlay2 storage driver (LVM is used in Ands v.1). Check also further docker configuration options: 'cgroup-driver', ...
+ - Integrate fast Ethernet and use conteiner native networking. OpenVSwitch is slow and causes problems.
+ - Do not run pods on Master nodes, but Gluster and a few databases pods (MySQL) are OK (multiple reasons, especially mounting a lot of Gluster Volumes)
+ - Object Storage should be integrated, either Gluster Block is ready for production or we have to use Ceph as well
+ - Automatic provisioning would be much better then handling volumes trough Ands. Basically, this will render Ands redundant. We can switch to Helm, etc.
+   But, we need ability to easily understand which volume belong to which pod/namespace and automatically kill redundant volumes. 
+
+Questions
+=========
+ - Updates to cluster configuration (evaluate current load, etc.)? Non-scheduling masters? Something with storage? Specify appropriate node parameters
+ - Shall we switch to plain Kubernetes or keep using OpenShift. Discussion (just take a not about security - it is right to ban containers running as
+ root, otherwise hazard to our data model):
+     https://cloudowski.com/articles/10-differences-between-openshift-and-kubernetes/
+ - Can we find a good distributed storage for data-intensive databases. Current, slave-master model requires too much manual attention.
+ - Can we find a way to run GUI applications in containers? Kind of having CUDA profiller would be nice.
+ - Think about monitoring. Probably SNMP + it would be nice to have some kind of SQL database with perofrmance metrics.
+