Various fixes before moving to hardware installation

author: Suren A. Chilingaryan <csa@suren.me> 2018-03-11 19:56:38 +0100
committer: Suren A. Chilingaryan <csa@suren.me> 2018-03-11 19:56:38 +0100
commit: f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf (patch)
tree: 3522ce77203da92bb2b6f7cfa2b0999bf6cc132c /docs/upgrade.txt
parent: 6bc3a3ac71e11fb6459df715536fec373c123a97 (diff)
download: ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.gz
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.bz2
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.xz
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.zip
1 files changed, 64 insertions, 0 deletions
diff --git a/docs/upgrade.txt b/docs/upgrade.txt
new file mode 100644
index 0000000..b4f22d6
--- /dev/null
+++ b/docs/upgrade.txt
@@ -0,0 +1,64 @@
+Upgrade
+-------
+ - The 'upgrade' may break things causing long cluster outtages or even may require a complete re-install.
+ Currently, I found problem with 'kube-service-catalog', but I am not sure problems are limited to it.
+ Furthermore, we currently using 'latest' tag of several docker images (heketi is example of a critical 
+ service on the 'latest' tag). Update may break things down.
+ 
+kube-service-catalog
+--------------------
+ - Update of 'kube-service-catalog' breaks OpenShift health check
+        curl -k https://apiserver.kube-service-catalog.svc/healthz
+ It complains on 'etcd'. The speific etcd check
+    curl -k https://apiserver.kube-service-catalog.svc/healthz/etcd
+ reports that all servers are unreachable.
+ 
+ - In fact etcd is working and the cluster is mostly functional. Occasionaly, it may suffer from the bug
+ described here:
+        https://github.com/kubernetes/kubernetes/issues/47131
+ The 'oc' queries are extremely slow and healthz service reports that there is too many connections.
+ Killing the 'kube-service-catalog/apiserver' helps for a while, but problem returns occasionlly.
+
+ - The information bellow is attempt to understand the reason. In fact, it is the list specifying that
+ is NOT the reason. The only found solution is to prevent update of 'kube-service-catalog' by setting
+         openshift_enable_service_catalog: false
+
+ - The problem only occurs if 'openshift_service_catalog' role is executed. It results in some 
+ miscommunication between 'apiserver' and/or 'control-manager' with etcd. Still the cluster is 
+ operational, so the connection is not completely lost, but is not working as expected in some
+ circustmances.
+
+ - There is no significant changes. The exactly same docker images are installed. The only change in
+ '/etc' is updated certificates used by 'apiserver' and 'control-manager'. 
+    * The certificates are located in '/etc/origin/service-catalog/' on the first master server. 
+    'oc adm ca' is used for generation. However, certificates in this folder are not used directly. They
+    are barely a temporary files used to generate 'secrets/service-catalog-ssl' which is used in
+    'apiserver' and 'control-manager'. The provisioning code is in:
+        openshift-ansible/roles/openshift_service_catalog/tasks/generate_certs.yml
+    it can't be disabled completely as registered 'apiserver_ca' variable is used in install.yml, but 
+    actual generation can be skipped and old files re-used to generate secret. 
+    * I have tried to modify role to keep old certificates. The healhz check was still broken afterwards.
+    So, this is update is not a problem (or at least not a sole problem). 
+ 
+ - The 'etcd' cluster seems OK. On all nodes, the etcd can be verified using
+            etcdctl3 member list
+    * The last command is actually bash alias which executes
+        ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints https://`hostname`:2379 member list
+    Actually, etcd is serving two ports 2379 (clients) and 2380 (peers). One idea was that may be the 
+    second port got problems. I was trying to change 2379 to 2380 in command above and it was failing.
+    However, it does not work either if the cluster in healhy state.
+    * One idea was that certificates are re-generated for wrong ip/names and, hence, certificate validation 
+    fails. Or that the originally generated CA is registered with etcd. This is certainly not the (only) issue
+    as problem persist even if we keep certificates intact. However, I also verified that newly generated 
+    certificates are completely similar to old ones and containe the correct hostnames inside.
+    * Last idea was that actually 'asb-etcd' is broken. It complains 
+        2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
+    However, the same error is present in log directly after install while the cluster is completely 
+    healthy.
+    
+ - The networking seems also not an issue. The configurations during install and upgrade are exactly the same.
+ All names are defined in /etc/hosts. Furthermore, the names in /etc/hosts are resolved (and back-resolved) 
+ by provided dnsmasq server. I.e. ipeshift1 resolves to 192.168.13.1 using nslookup and 192.168.13.1 resolves
+ back to ipeshift1. So, the configuration is indistinguishable from proper one with properly configured DNS.
+ 
+  
+\ No newline at end of file
author	Suren A. Chilingaryan <csa@suren.me>	2018-03-11 19:56:38 +0100
committer	Suren A. Chilingaryan <csa@suren.me>	2018-03-11 19:56:38 +0100
commit	f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf (patch)
tree	3522ce77203da92bb2b6f7cfa2b0999bf6cc132c /docs/upgrade.txt
parent	6bc3a3ac71e11fb6459df715536fec373c123a97 (diff)
download	ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.gz ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.bz2 ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.xz ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.zip