Update documentation & usersHEAD master

author: Suren A. Chilingaryan <csa@suren.me> 2024-07-29 22:32:00 +0200
committer: Suren A. Chilingaryan <csa@suren.me> 2024-07-29 22:32:00 +0200
commit: 4175af7f92ad7357b83ceb56f2a6d42a8243cd80 (patch)
tree: bb4af8cbb7495c179b0e257ca337f10132f63170 /docs/troubleshooting.txt
parent: 0fbe7da54cd41846d7debfc49d25397ad8fc69a0 (diff)
download: ands-master.tar.gz
ands-master.tar.bz2
ands-master.tar.xz
ands-master.zip
1 files changed, 13 insertions, 1 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index 315f9f4..0621b25 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -151,8 +151,17 @@ nodes: domino failures
     * This might continue infinitely as one node is gets disconnected after another, pods get rescheduled, and process never stops
     * The only solution is to remove temporarily some pods, e.g. ADEI pods could be easily removed and, then, provivisioned back
 
-pods: very slow scheduling (normal start time in seconds range), failed pods, rogue namespaces, etc...
+pods: failed or very slow scheduling (normal start time in seconds range), failed pods, rogue namespaces, etc...
 ====
+  - LSDF mounts might cause pod-scheduling to fail
+    * It seems OpenShift tries to index (chroot+chmod) files on mount and timeouts if LSDF volume has too many small files...
+    * Reducing number of files with 'subPath' doesn't help here, but setting more specific 'networkPath' in pv helps
+    * Suggestion is to remove fsGroup from 'dc' definition, but it is added automatically if pods use network volumes,
+    setting volume 'gid' (cifs mount parameters specified in 'mountOptions' in pv definition) to match fsGroup doesn't help either
+    * Timeout seems to be fixed to 2m and is not configurable...
+    * Later versions of OpenShift has 'fsGroupChangePolicy=OnRootMismatch' parameter, but it is not present in 3.9
+    => Honestly, solution is unclear besides reducing number of files or mounting a small share subset with little fieles
+
  - OpenShift has numerous problems with clean-up resources after the pods. The problems are more likely to happen on the 
  heavily loaded systems: cpu, io, interrputs, etc.
     * This may be indicated in the logs with various errors reporting inability to stop containers/processes, free network
@@ -450,3 +459,6 @@ Various
  - IPMI may cause problems as well. Particularly, the mounted CDrom may start complaining. Easiest is
  just to remove it from the running system with
      echo 1 > /sys/block/sdd/device/delete
+
+ - 'oc get scc' reports the server doesn't have a resource type "scc"
+    Delete (will be restarted) 'apiserver-*' pod in the 'kube-service-catalog' namespace
author	Suren A. Chilingaryan <csa@suren.me>	2024-07-29 22:32:00 +0200
committer	Suren A. Chilingaryan <csa@suren.me>	2024-07-29 22:32:00 +0200
commit	4175af7f92ad7357b83ceb56f2a6d42a8243cd80 (patch)
tree	bb4af8cbb7495c179b0e257ca337f10132f63170 /docs/troubleshooting.txt
parent	0fbe7da54cd41846d7debfc49d25397ad8fc69a0 (diff)
download	ands-master.tar.gz ands-master.tar.bz2 ands-master.tar.xz ands-master.zip