From 0b0b9954c2d0602b1e9d0a387d2a195a790f8084 Mon Sep 17 00:00:00 2001 From: "Suren A. Chilingaryan" Date: Thu, 22 Mar 2018 04:37:46 +0100 Subject: Various fixes and provide ADEI admin container... --- docs/databases.txt | 14 ++++++- docs/info.txt | 31 -------------- docs/kickstart.txt | 13 ++++++ docs/status.txt | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 144 insertions(+), 33 deletions(-) delete mode 100644 docs/info.txt create mode 100644 docs/kickstart.txt create mode 100644 docs/status.txt (limited to 'docs') diff --git a/docs/databases.txt b/docs/databases.txt index 331313b..7f8468e 100644 --- a/docs/databases.txt +++ b/docs/databases.txt @@ -8,7 +8,7 @@ Gluster/Block MyISAM (no logs) 5 MB/s slow, but OK 200% ~ 50% No problems on reboot, but requires manual work if node crashes to detach volume. Galera INNODB 3.5 MB/s fast 3 x 200% - Should be perfect, but I am not sure about automatic recovery... Galera/Hostnet INNODB 4.6 MB/s fast 3 x 200% - - MySQL Slaves INNODB 5-8 MB/s fast 2 x 250% - Available data is HA, but caching is not. We can easily turn the slave to master. + MySQL Slaves INNODB 5-6 MB/s fast 2 x 250% - Available data is HA, but caching is not. We can easily turn the slave to master. DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit the OpenShift concept that well. @@ -150,5 +150,15 @@ Master/Slave replication slave side. Network is not a problem, it is able to get logs from the master, but it is significantly slower in applying it. The main performance killer is disk sync operations triggered by 'sync_binlog', INNODB log flashing, etc. Disabling it allows to bring performance on reasonable level. Still, - the master is caching at about 6-8 MB/s and slave at 4-5 MB/s only. + the master is caching at about 6-8 MB/s and slave at 4-5 MB/s only (sometimes drops bellow 2 MB/s). + + - The trouble I think is that Slave performs a lot of disk writes 'mysql-relay-bin.*', 'mysql-bin.*'. + If compared all together we get ~ 18 MB/s. The solution is to disable binary logging on the slave + side. We need 'relay' log to perform replication, but binary-log on the client will only be needed + if another slave would chain replicate for it. However, it is better to disable just logging of + data replicated from master by disabling 'log_slave_updates'. Then, if the slave is converted to master + it will automatically start logging. + + + \ No newline at end of file diff --git a/docs/info.txt b/docs/info.txt deleted file mode 100644 index ea00f58..0000000 --- a/docs/info.txt +++ /dev/null @@ -1,31 +0,0 @@ -oc -n adei patch dc/mysql --type=json --patch '[{"op": "remove", "path": "/spec/template/spec/nodeSelector"}]' -oc process -f mysql.yml | oc -n adei replace dc/mysql -f - -oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq -We use rpcbind from the host. -we need isciinitiators, rpcbind is used for host but check with telnet. The mother volumes are provisioned 100GiB large. So we can't allocate more. - -We can use rpcbind (and other services) from the host. Host networking. -oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq -| grep -oP '^GBID:\s*\K.*' - -Top level (nodeSelector restarPolciy SecurityContext) - dnsPolicy: ClusterFirstWithHostNet - dnsPolicy: ClusterFirst - hostNetwork: true -oc -n kaas adm policy add-scc-to-user hostnetwork -z default -Check (in users list) -oc get scc hostnetwork -o yaml -firewall-cmd --add-port=5002/tcp - - OnDelete: This is the default update strategy for backward-compatibility. With OnDelete update strategy, after you update a DaemonSet template, new DaemonSet pods will only be created when you manually delete old DaemonSet pods. This is the same behavior of DaemonSet in Kubernetes version 1.5 or before. - RollingUpdate: With RollingUpdate update strategy, after you update a DaemonSet template, old DaemonSet pods will be killed, and new DaemonSet pods will be created automatically, in a controlled fashion. - -Caveat: Updating DaemonSet created from Kubernetes version 1.5 or before -.spec.updateStrategy.rollingUpdate.maxUnavailable (default to 1) and .spec.minReadySeconds - - - - “Default”: The Pod inherits the name resolution configuration from the node that the pods run on. See related discussion for more details. - “ClusterFirst”: Any DNS query that does not match the configured cluster domain suffix, such as “www.kubernetes.io”, is forwarded to the upstream nameserver inherited from the node. Cluster administrators may have extra stub-domain and upstream DNS servers configured. See related discussion for details on how DNS queries are handled in those cases. - “ClusterFirstWithHostNet”: For Pods running with hostNetwork, you should explicitly set its DNS policy “ClusterFirstWithHostNet”. - “None”: A new option value introduced in Kubernetes v1.9. This Alpha feature allows a Pod to ignore DNS settings from the Kubernetes environment. All DNS settings are supposed to be provided using the dnsConfig field in the Pod Spec. See DNS config subsection below. diff --git a/docs/kickstart.txt b/docs/kickstart.txt new file mode 100644 index 0000000..fb2b5da --- /dev/null +++ b/docs/kickstart.txt @@ -0,0 +1,13 @@ +Troubleshooting +=============== + - If re-installed, there is some leftovers from LVM/Device Mapper causing CentOS installer + to crash (even if 'clearpart' is specified). After first crash, it may be useful. + * Clean partition tables with + dd if=/dev/zero of=/dev/ bs=512 count=1024 + * Destroy rogue LVM VGs + vgdestroy + * Destroy rogue device mapper devices + dmsetup info -C + dmsetup remove + + \ No newline at end of file diff --git a/docs/status.txt b/docs/status.txt new file mode 100644 index 0000000..681c953 --- /dev/null +++ b/docs/status.txt @@ -0,0 +1,119 @@ + OpenShift cluster is up and running. I don't plan to re-install it unless there is a new severe problems turn out for Bora/KatrinDB. +It seems the system is fine for adei. You can take a look on http://adei-katrin.kaas.kit.edu + +So, that we have: + - Automatic kickstart of the servers. This is normally done over DHCP. Since I don't have direct control over DHCP, + I made a simple system to kickstart over IPMI. Scripts instruct servers to boot from Virtual CD and fetch kickstart + from the web server (ufo.kit.edu actually). The kickstart is generated by php script based on server name and/or + DHCP address. + - Ansible-based playbooks to configure the complete cluster. Kickstart produces minimal systems with ssh server up + and running. Here, I have the scripts to build complete cluster, including databases and adei installation. There + are also playbooks for maintenance tasks like adding new nodes, etc. + - Upgrades are not always (or actually rarely) running smoothely. To test new configurations before applying them + to the production system, I also support provisioning of the staging cluster in vagrant-controlled virtual machines. + This is currnetly running on ipepdvcompute3. + - Replicated GlusterFS storage and some security provisions to prevent conteiners in one project to destroy data + belonging to the another. The selected subset of data can be made available over NFS to external hosts, but I'd + rather prefer to not overuse this possibility. + - To simplify creating conteiners with complex storage requirements (like ADEI), there are also Ansible scripts + to generate OpenShift templates based on the configured storage and provided container specification. + - To ensure data integrity, the database engines do a lot of locking, syncing, and small writes. This does not + play well with network file systems like GlusterFS. It is possible to tune database parameters a bit and run + databases with small intensity of writes, but it is unsuitable for large and write-intensive workloads. The + alternative is to store data directly on local storage and use repliation engine of database itself to ensure + high availability. I have prepared containers to quickly bootstrap two options Master-Master replication with + Galera and standard MySQL Master-Slave replication. Master/Slave replication is assynchronous and because of + that significantly faster and I use it as a good compromise. It will take about 2 weeks to re-cache all Katrin + data. Quite long, but it is even longer with other options. If master server crashes the users will still have + access to all the historical archives and will be able to proxy data requests to the source datbase (i.e. + BORA will also work). And there is no need to re-cache everything as slave could be easily converted to master. + The Master/Master replication is about 50% slower, but still can be used for smaller databases if also uniterrupted + writes are crucial. + - Distributed ADEI. All setups now is completely independent (use different databases, can be stopped and started + independently, etc.). Each setup constists of 3 main components: + 1) Frontends: There are 3 frontends: production, debugging, and for logs. They are accessible individually, + like: + http://adei-katrin.kaas.kit.edu + http://adei-katrin-debug.kaas.kit.edu + http://adei-katrin-logs.kaas.kit.edu + * The production frontend can be scalled to run several replicas. This is not required for performance, + but if 2 copies are running, there will be no service interruption if one of servers crashed. Otherwise, + there is a dozen minutes outtage will OpenShift detects that node is gone for good. + 2) Maintenance: There is cron-style containers performing various maintenance tasks. Particularly, they + analyze current data source configuration and schedule the caching. + 3) Cachers: The caching is performed by a 3 groups of conteiners. One is responsible for current data, + the second for archives, and third for logs. Each group can be scalled independently. I.e. in the begining + I run multiple archive-caching replicas to get the data in. Then, focus is shifted to getting current data + faster. It also may differ significantly between setups. Katrin will run multiple caching replicas, but less + important or small data sources will get only one. + * This architecture also allows to remove 1-minute update latency. I am not sure we can be very fast with + larget Katrin setup on current minimalistic cluster, but technically the updates can be as fast as hardware + allows. + - There is an OpenShift template to instantiate all this containers in one go by providing a few parameters. The + only requirement is to have 'setup' configuration in my repository. I also included in ADEI sources a bunch of + scripts to start all known setups with preconfigured parameters and to perform various maintenance tasks. Like, + ./provision.sh -s katrin - to create launch katrin setup on the cluster + ./stop-caching.sh -s katrin - to temporary stop caching + ./start-caching.sh -s katrin - to restart caching with pre-configured number of replicas + ... + - Load-balancing and high-availability using 2 ping-pong IPs. By default katrin1 & katrin2 IPs are assigned to + both masters of the clusters. To balance load, the kaas.kit.edu is resolved in round-robin fashion to one of them. + If one master failed, its IP will migrate to remaining master and no service interruption will occur. Both masters + run OpenVPN tunnels to Katrin network. The remaining node is routing trough one of the masters. This configuration + is also highly available and should not suffer if one of the masters crashing. + +What we are still missing: + - Katrin datbasse. Marco have prepared containers using prototype I run last years. Hopefully, Jan can run it on + the new system with minimal number of problems. + - BORA still need to be moved. + - Then, I will decommision the old Katrin server. + +Fault-tolerance and high-availability +===================================== + - I have tested a bit for fault tolerance and recoverability. Both GlusterFS and OpenShift work fine if a single + node failed. All data is available and new data can be written without problems. There is also no service + interruption as ADEI runs two frontends and also includes backup MySQL server. Only caching may stop if + master MySQL server is hit. + - If node recovers, it will-be re-intergrated automatically. We may only need to manually convert MySQL slave + to Master. Adding replacement nodes is also working quite easy using provided provisioning playbooks. But the + Gluster bricks needs to be migrated manually. I provide also some scripts to simplify this task. + - The situation is worse if cluster is completely turned off and turned on. Storage survive quite well, but + it is necessary to check that all volumes fully healthy (sometimes volume loose some bricks and needs to be + restarted to reconnect them). Also, some pods running before reboot may fail to start. Overall, it is better + to avoid. If reboot is needed for some reason, the best approach is to perform rolling reboot, restarting + one node after another to keep cluster always alive. + +Performance +=========== + I'd say it is more-or-less on pair with the old system (which is expected). The computing capabilities are + quite faster (still there is a significant load on master servers to run the cluster), but network and + storage have more-or-less the same speed. + - In fact we have only a single storage node. The second is used for replication and third node is required + for arbitrage in split-brain case. I will be able to use the third node also for storage, but I need at least + another 4th node in the cluster to do it. The new drives slightly faster, but added replication slows down + performance considerably. + - The containers can't use Infiniband network efficiently. The bridging is used to allow fast networking + in conteiners. Unfortunatelly, IPoIB is a high level network layer and does not provide Ethernet L2 support + required to create the bridge. Consequnetly, there is a lot of packet re-building going on and the network + performance is capped at about 3 GBit/s for containers. It is not realy a big problem now, as host systems + are not limited. So the storage is able to use all bandwidth. + +Upgrades +======== + We may need to scale the cluster slightly if it to be used beyond Katrin needs or with the significant + increase of load by Katrin. Having 1-2 more nodes should be helpful to storage system. It also may be worth + to add 40Gbit Ethernet switch. The Mellanox cards work in both Ethernet and Infiniband modes. Their switch + is actually also, but they want 5 kEUR for the license to enable this feature. I guess for this money we + should be able to buy a new piece of hardware. + + Having more storage nodes, we can also prototype new data management solutions without disturbing Katrin + tasks. One idea would be run Apache Cassandra and try to re-use the data broker developed by university + group to get ADEI data in their cluster. Then, we can add analysis framework on top of ADEI using + Jupiter notebooks with Cassandra. Furthermore, there is Alpha version for NVIDIA GPUs support for + OpenShift. So, we can try to integrate also some computing workload and potentially run also WAVe inside. + I'd not do it on production system, but if we get new nodes we first may try to setup a second OpenShift + cluster for testing (pair of storage nodes + GPU node). And later re-integrate it with the main one. + + As running without shutdowns is pretty crucial, another question if we can put the servers somewhere + at SCC with reliable power supplyies, air conditioning, etc. I guess we can't expect to run without shutdowns + in our server room. -- cgit v1.2.3