summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/README91
-rw-r--r--docs/benchmarks/netpipe-hostnet-connect2clusterip.txt (renamed from docs/benchmarks/netpipe-hostnet-clusterip.txt)0
-rw-r--r--docs/benchmarks/netpipe-hostnet-connect2hostip.txt (renamed from docs/benchmarks/netpipe-hostnet-hostip.txt)0
-rw-r--r--docs/benchmarks/netpipe-pod2host-ownhost.txt124
-rw-r--r--docs/benchmarks/netpipe-pod2host.txt124
-rw-r--r--docs/benchmarks/netpipe-pod2pod-clusternet2hostnet.txt124
-rw-r--r--docs/configs.txt3
-rw-r--r--docs/databases.txt62
-rw-r--r--docs/info.txt31
-rw-r--r--docs/infrastructure.txt110
-rw-r--r--docs/links.txt16
-rw-r--r--docs/managment.txt8
-rw-r--r--docs/network.txt9
-rw-r--r--docs/performance.txt54
-rw-r--r--docs/vagrant.txt4
15 files changed, 742 insertions, 18 deletions
diff --git a/docs/README b/docs/README
new file mode 100644
index 0000000..4f75b5b
--- /dev/null
+++ b/docs/README
@@ -0,0 +1,91 @@
+OpenShift Platform
+------------------
+The OpenShift web frontend is running at
+ https://kaas.kit.edu:8443
+
+However, I find it simpler to use command line tool 'oc' which
+ - On RedHat platforms the package is called 'origin-clients' and is installed
+ from OpenShift repository available as package 'centos-release-openshift-origin'.
+ - For other distribut check here (we are running version 3.7)
+ https://docs.openshift.org/latest/cli_reference/get_started_cli.html#installing-the-cli
+
+Basically, it is also a good documentation to start using it.
+ https://docs.openshift.com/container-platform/3.7/dev_guide/index.html
+
+Infrastructure
+--------------
+ - We have 3 servers running with names ipekatrin[1-3].ipe.kit.edu. This is internal names. The external
+ access is provided using 2 virtual ping-poing ip's katrin[1-2].ipe.kit.edu. By default they are assigned
+ to both master servers of the cluster, but will migrate both to a single surviving server if one of the
+ masters die. This is enabled by keepalived daemon and ensures load-balancing and high-availability.
+ The domain name 'kaas.kit.edu' is resolved to both ips in round-robin fashion.
+
+ - By default, the executed service have have names in the form '<service-name>.kaas.kit.edu'. For instance,
+ you can test
+ adei-katrin.kaas.kit.edu - This is a ADEI service running on the new platform
+ adas-autogen.kaas.kit.edu - Sample ADEI with generated data
+ katrin.kaas.kit.edu - Is the placehorder for futre katrin router
+ etc.
+
+ - OpenVPN connection with KATRIN virtual network is running on master servers. Non-masters route the traffic
+ trough the masters using keepalived IP. So, katrin network should be transparently visible from any pod in
+ the cluster.
+
+Users
+-----
+ I have configured a few user accounts using ADEI and UFO passwords. Furthermore, to avoid a mess of
+conteiners, I have created a number of projects with appropriate administrators.
+ kaas (csa, kopmann) - This is a routing service (basically Apache mod_rewrite) to set redirects from http://katrin.kit.edu/*
+ katrin (katrin) - Katrin database
+ adei (csa) - All ADEI setups
+ bora (ntj) - BORA
+ web (kopmann) - Various web sites, etc.
+ mon (csa) - Monitoring
+ test (*) - Project for testing
+
+If needed, I can create more projects/users. Just let me know.
+
+Storage
+-------
+ I have created a couple of gluster volumes for different purpose:
+ katrin_data: - For katrin data files
+ datastore - Other non-katrin large data files
+ openshift - 3 times replicated volume for configuration, sources, and other important small files
+ temporary - Logs, temporary files, etc.
+
+ Again, to not mess data from the different projects, on each volume there are subfolders for all projects. Furthermore,
+ I have tried to add a bit of protection and assigned each project a range of group ids. The subfolders can only be read
+ by appropriate group. I also pre-created correpsonding PersistentVolume (pv) and PersistentVolumeClaims (pvc): 'katrin', 'data', ...
+
+ There is a special pvc called 'host'. This is to save data on the local raid array bypassing gluster (i.e. on each OpenShift node
+ the content of the folder will be different).
+
+ WARNING: Gluster supports dynamic provisioning using Heketi. It is installed and worked. However, heketi is far from being
+ of production quality. I think it is OK to use it for some temporary data if you want, but I would suggest to use pre-created
+ volumes for important data.
+
+ - Curently, I don't plan to provide access to the servers itself. The storage should be managed from the OpenShift pods solely.
+ I made a sample 'manager' pod equipped with scp, lftp, curl, etc. It mounts all default storage. You need to start it and, then,
+ you also can connect interactively either using both web interace and console app.
+ oc -n katrin scale dc/kaas-manager --replicas 1
+ oc -n katrin rsh dc/kaas-manager
+ Just an example, build your own configuration with required set of packages.
+
+Databases
+---------
+ Gluster works fine if you mostly read data or if you perform mostly sequential writes. It plays very bad with 'databases' and similar
+ loads. I guess it should not be issue for Katrin database as it is relatively small (AFAIK) and do not perform many writes. For something,
+ like ADEI the gluster is not viable option to back MySQL server. There are several options to handle volumes for appliations performing a
+ large amount of small random writes:
+ - If High Availability (HA) is not important, just pin a pod to a certain node and use 'host' pvc.
+ - For databases, either Master/Slave replication can be enabled (you will still need to pin node and use 'host' pvc). The Galera cluster
+ can be installed for multi-master replication. It is configured using StatefulSet feature of OpenShift. I have not tested recovery throughly,
+ but it is working, quite performant, and masters are synchronized without problems.
+ - For non-database applications, the Gluster block storage may be used. The block storage is not shared between multiple pods, but private
+ to a specific pod. So, it is possible to avoid certain amount of locking and context switches. So, performance is significantly beter. I was
+ even able to run ADEI database on top of such device. Though it is still singificnatly slower than native host performance. There is again
+ heketi-based provisioner, but it works even worse when one providing standard Gluster volumes. So, I suggest to ask me to create block
+ devices manually if necessary.
+
+ Otherall, if you have data intensive workload, we can discuss the best approach.
+ \ No newline at end of file
diff --git a/docs/benchmarks/netpipe-hostnet-clusterip.txt b/docs/benchmarks/netpipe-hostnet-connect2clusterip.txt
index 452a59b..452a59b 100644
--- a/docs/benchmarks/netpipe-hostnet-clusterip.txt
+++ b/docs/benchmarks/netpipe-hostnet-connect2clusterip.txt
diff --git a/docs/benchmarks/netpipe-hostnet-hostip.txt b/docs/benchmarks/netpipe-hostnet-connect2hostip.txt
index 494289d..494289d 100644
--- a/docs/benchmarks/netpipe-hostnet-hostip.txt
+++ b/docs/benchmarks/netpipe-hostnet-connect2hostip.txt
diff --git a/docs/benchmarks/netpipe-pod2host-ownhost.txt b/docs/benchmarks/netpipe-pod2host-ownhost.txt
new file mode 100644
index 0000000..d49e340
--- /dev/null
+++ b/docs/benchmarks/netpipe-pod2host-ownhost.txt
@@ -0,0 +1,124 @@
+ 1 0.657660 0.00001160
+ 2 1.261032 0.00001210
+ 3 1.993782 0.00001148
+ 4 2.558493 0.00001193
+ 6 3.738559 0.00001224
+ 8 5.187374 0.00001177
+ 12 7.518725 0.00001218
+ 13 8.196018 0.00001210
+ 16 10.198276 0.00001197
+ 19 12.731722 0.00001139
+ 21 13.796771 0.00001161
+ 24 15.639661 0.00001171
+ 27 16.674961 0.00001235
+ 29 18.053967 0.00001226
+ 32 19.608304 0.00001245
+ 35 21.442525 0.00001245
+ 45 28.959251 0.00001186
+ 48 32.887048 0.00001114
+ 51 32.061613 0.00001214
+ 61 36.213737 0.00001285
+ 64 41.694947 0.00001171
+ 67 43.056353 0.00001187
+ 93 57.911407 0.00001225
+ 96 59.014557 0.00001241
+ 99 61.014493 0.00001238
+ 125 76.480392 0.00001247
+ 128 78.037922 0.00001251
+ 131 83.034218 0.00001204
+ 189 119.373976 0.00001208
+ 192 124.676617 0.00001175
+ 195 122.816000 0.00001211
+ 253 153.867359 0.00001254
+ 256 156.339712 0.00001249
+ 259 160.935138 0.00001228
+ 381 236.756142 0.00001228
+ 384 233.925416 0.00001252
+ 387 260.149962 0.00001135
+ 509 323.528140 0.00001200
+ 512 307.977445 0.00001268
+ 515 313.039822 0.00001255
+ 765 481.796856 0.00001211
+ 768 481.708998 0.00001216
+ 771 486.042697 0.00001210
+ 1021 645.735673 0.00001206
+ 1024 633.657979 0.00001233
+ 1027 636.196119 0.00001232
+ 1533 938.614280 0.00001246
+ 1536 869.867765 0.00001347
+ 1539 930.918606 0.00001261
+ 2045 802.337366 0.00001945
+ 2048 807.090888 0.00001936
+ 2051 772.193892 0.00002026
+ 3069 1170.196266 0.00002001
+ 3072 1219.905239 0.00001921
+ 3075 1174.839492 0.00001997
+ 4093 1464.200824 0.00002133
+ 4096 1566.409830 0.00001995
+ 4099 1512.849103 0.00002067
+ 6141 2286.945465 0.00002049
+ 6144 2288.367849 0.00002048
+ 6147 2207.269593 0.00002125
+ 8189 2923.440347 0.00002137
+ 8192 2941.778132 0.00002125
+ 8195 2904.909553 0.00002152
+ 12285 4165.044677 0.00002250
+ 12288 4039.677896 0.00002321
+ 12291 4252.381651 0.00002205
+ 16381 5656.041761 0.00002210
+ 16384 5614.844855 0.00002226
+ 16387 5382.844899 0.00002323
+ 24573 7719.968664 0.00002428
+ 24576 7414.582998 0.00002529
+ 24579 7817.458860 0.00002399
+ 32765 9775.709423 0.00002557
+ 32768 9442.714388 0.00002648
+ 32771 9770.830694 0.00002559
+ 49149 12826.142657 0.00002924
+ 49152 12626.640048 0.00002970
+ 49155 12477.873858 0.00003006
+ 65533 10571.284885 0.00004730
+ 65536 10570.211691 0.00004730
+ 65539 10084.322046 0.00004958
+ 98301 14402.304952 0.00005207
+ 98304 14642.413170 0.00005122
+ 98307 13935.428925 0.00005382
+ 131069 15142.825700 0.00006604
+ 131072 15790.566346 0.00006333
+ 131075 15281.133509 0.00006544
+ 196605 16728.089456 0.00008967
+ 196608 17013.589640 0.00008816
+ 196611 16828.700106 0.00008913
+ 262141 18961.512211 0.00010548
+ 262144 18106.104774 0.00011046
+ 262147 18399.271190 0.00010870
+ 393213 20323.824073 0.00014761
+ 393216 20333.341617 0.00014754
+ 393219 20240.703379 0.00014822
+ 524285 21133.676444 0.00018927
+ 524288 21646.010507 0.00018479
+ 524291 20236.384690 0.00019766
+ 786429 22905.368103 0.00026195
+ 786432 24588.762530 0.00024401
+ 786435 23509.571645 0.00025522
+ 1048573 24178.948658 0.00033087
+ 1048576 23115.860503 0.00034608
+ 1048579 24086.879088 0.00033213
+ 1572861 24694.199111 0.00048594
+ 1572864 26163.572804 0.00045865
+ 1572867 26574.262072 0.00045157
+ 2097149 25834.680617 0.00061932
+ 2097152 27875.424112 0.00057398
+ 2097155 28140.234242 0.00056858
+ 3145725 28596.052642 0.00083928
+ 3145728 28682.984164 0.00083673
+ 3145731 23866.770612 0.00100558
+ 4194301 26493.728895 0.00120783
+ 4194304 27059.094386 0.00118260
+ 4194307 24088.422769 0.00132844
+ 6291453 24962.945897 0.00192285
+ 6291456 27125.867373 0.00176953
+ 6291459 26348.212014 0.00182176
+ 8388605 25510.943520 0.00250873
+ 8388608 24568.383940 0.00260497
+ 8388611 26392.930637 0.00242489
diff --git a/docs/benchmarks/netpipe-pod2host.txt b/docs/benchmarks/netpipe-pod2host.txt
new file mode 100644
index 0000000..4d18f48
--- /dev/null
+++ b/docs/benchmarks/netpipe-pod2host.txt
@@ -0,0 +1,124 @@
+ 1 0.420894 0.00001813
+ 2 0.796803 0.00001915
+ 3 1.211204 0.00001890
+ 4 1.649525 0.00001850
+ 6 2.424251 0.00001888
+ 8 3.125162 0.00001953
+ 12 4.826660 0.00001897
+ 13 5.124437 0.00001935
+ 16 6.351908 0.00001922
+ 19 8.006427 0.00001811
+ 21 8.302512 0.00001930
+ 24 9.368963 0.00001954
+ 27 10.343990 0.00001991
+ 29 11.566140 0.00001913
+ 32 12.787361 0.00001909
+ 35 14.036485 0.00001902
+ 45 17.719754 0.00001938
+ 48 19.013013 0.00001926
+ 51 20.357754 0.00001911
+ 61 23.945906 0.00001944
+ 64 24.833939 0.00001966
+ 67 27.664993 0.00001848
+ 93 35.644683 0.00001991
+ 96 37.158695 0.00001971
+ 99 39.587251 0.00001908
+ 125 51.079576 0.00001867
+ 128 50.130411 0.00001948
+ 131 51.457388 0.00001942
+ 189 75.380130 0.00001913
+ 192 76.342438 0.00001919
+ 195 79.420945 0.00001873
+ 253 105.186533 0.00001835
+ 256 106.372298 0.00001836
+ 259 106.029274 0.00001864
+ 381 151.168028 0.00001923
+ 384 152.753542 0.00001918
+ 387 159.197729 0.00001855
+ 509 207.619883 0.00001870
+ 512 208.712379 0.00001872
+ 515 209.517685 0.00001875
+ 765 314.054051 0.00001858
+ 768 299.822502 0.00001954
+ 771 287.433917 0.00002046
+ 1021 397.548577 0.00001959
+ 1024 408.368406 0.00001913
+ 1027 420.775950 0.00001862
+ 1533 583.916264 0.00002003
+ 1536 572.817784 0.00002046
+ 1539 580.157425 0.00002024
+ 2045 499.904926 0.00003121
+ 2048 507.540741 0.00003079
+ 2051 524.322916 0.00002984
+ 3069 770.975516 0.00003037
+ 3072 746.563147 0.00003139
+ 3075 789.196027 0.00002973
+ 4093 1005.896826 0.00003104
+ 4096 1001.910613 0.00003119
+ 4099 1066.934103 0.00002931
+ 6141 1354.959563 0.00003458
+ 6144 1390.924953 0.00003370
+ 6147 1469.109813 0.00003192
+ 8189 1849.608991 0.00003378
+ 8192 1839.807660 0.00003397
+ 8195 1856.771767 0.00003367
+ 12285 2577.255660 0.00003637
+ 12288 2559.043820 0.00003663
+ 12291 2595.115904 0.00003613
+ 16381 3310.384291 0.00003775
+ 16384 3202.585996 0.00003903
+ 16387 3410.545389 0.00003666
+ 24573 4461.016945 0.00004203
+ 24576 4183.724225 0.00004482
+ 24579 4243.889480 0.00004419
+ 32765 5288.958972 0.00004726
+ 32768 5328.798686 0.00004691
+ 32771 5277.353091 0.00004738
+ 49149 6339.504613 0.00005915
+ 49152 6402.924842 0.00005857
+ 49155 6480.738141 0.00005787
+ 65533 4709.518059 0.00010616
+ 65536 4613.364349 0.00010838
+ 65539 4932.498325 0.00010137
+ 98301 6066.768938 0.00012362
+ 98304 5998.359888 0.00012503
+ 98307 6098.265480 0.00012299
+ 131069 6117.938429 0.00016345
+ 131072 6324.061201 0.00015813
+ 131075 6285.324621 0.00015910
+ 196605 6557.328589 0.00022875
+ 196608 8022.864625 0.00018697
+ 196611 8524.213528 0.00017597
+ 262141 8846.468887 0.00022608
+ 262144 8678.411984 0.00023046
+ 262147 8237.604968 0.00024279
+ 393213 11383.947500 0.00026353
+ 393216 11671.364535 0.00025704
+ 393219 12134.274110 0.00024724
+ 524285 10564.415738 0.00037863
+ 524288 10541.035553 0.00037947
+ 524291 12139.945493 0.00032949
+ 786429 13031.143983 0.00046043
+ 786432 13255.902187 0.00045263
+ 786435 13528.481196 0.00044351
+ 1048573 12102.584918 0.00066101
+ 1048576 11365.676465 0.00070387
+ 1048579 13355.335488 0.00059901
+ 1572861 11314.688623 0.00106057
+ 1572864 14604.826569 0.00082165
+ 1572867 11649.668141 0.00103007
+ 2097149 8779.830027 0.00182236
+ 2097152 12092.373128 0.00132315
+ 2097155 10640.598403 0.00150368
+ 3145725 11399.940287 0.00210527
+ 3145728 11327.829590 0.00211868
+ 3145731 12330.448131 0.00194640
+ 4194301 10110.197684 0.00316512
+ 4194304 1853.739580 0.01726240
+ 4194307 8969.381449 0.00356770
+ 6291453 10336.475983 0.00464375
+ 6291456 10847.818034 0.00442485
+ 6291459 12336.471285 0.00389090
+ 8388605 13628.581728 0.00469601
+ 8388608 10168.895623 0.00629370
+ 8388611 13799.286656 0.00463792
diff --git a/docs/benchmarks/netpipe-pod2pod-clusternet2hostnet.txt b/docs/benchmarks/netpipe-pod2pod-clusternet2hostnet.txt
new file mode 100644
index 0000000..4d18f48
--- /dev/null
+++ b/docs/benchmarks/netpipe-pod2pod-clusternet2hostnet.txt
@@ -0,0 +1,124 @@
+ 1 0.420894 0.00001813
+ 2 0.796803 0.00001915
+ 3 1.211204 0.00001890
+ 4 1.649525 0.00001850
+ 6 2.424251 0.00001888
+ 8 3.125162 0.00001953
+ 12 4.826660 0.00001897
+ 13 5.124437 0.00001935
+ 16 6.351908 0.00001922
+ 19 8.006427 0.00001811
+ 21 8.302512 0.00001930
+ 24 9.368963 0.00001954
+ 27 10.343990 0.00001991
+ 29 11.566140 0.00001913
+ 32 12.787361 0.00001909
+ 35 14.036485 0.00001902
+ 45 17.719754 0.00001938
+ 48 19.013013 0.00001926
+ 51 20.357754 0.00001911
+ 61 23.945906 0.00001944
+ 64 24.833939 0.00001966
+ 67 27.664993 0.00001848
+ 93 35.644683 0.00001991
+ 96 37.158695 0.00001971
+ 99 39.587251 0.00001908
+ 125 51.079576 0.00001867
+ 128 50.130411 0.00001948
+ 131 51.457388 0.00001942
+ 189 75.380130 0.00001913
+ 192 76.342438 0.00001919
+ 195 79.420945 0.00001873
+ 253 105.186533 0.00001835
+ 256 106.372298 0.00001836
+ 259 106.029274 0.00001864
+ 381 151.168028 0.00001923
+ 384 152.753542 0.00001918
+ 387 159.197729 0.00001855
+ 509 207.619883 0.00001870
+ 512 208.712379 0.00001872
+ 515 209.517685 0.00001875
+ 765 314.054051 0.00001858
+ 768 299.822502 0.00001954
+ 771 287.433917 0.00002046
+ 1021 397.548577 0.00001959
+ 1024 408.368406 0.00001913
+ 1027 420.775950 0.00001862
+ 1533 583.916264 0.00002003
+ 1536 572.817784 0.00002046
+ 1539 580.157425 0.00002024
+ 2045 499.904926 0.00003121
+ 2048 507.540741 0.00003079
+ 2051 524.322916 0.00002984
+ 3069 770.975516 0.00003037
+ 3072 746.563147 0.00003139
+ 3075 789.196027 0.00002973
+ 4093 1005.896826 0.00003104
+ 4096 1001.910613 0.00003119
+ 4099 1066.934103 0.00002931
+ 6141 1354.959563 0.00003458
+ 6144 1390.924953 0.00003370
+ 6147 1469.109813 0.00003192
+ 8189 1849.608991 0.00003378
+ 8192 1839.807660 0.00003397
+ 8195 1856.771767 0.00003367
+ 12285 2577.255660 0.00003637
+ 12288 2559.043820 0.00003663
+ 12291 2595.115904 0.00003613
+ 16381 3310.384291 0.00003775
+ 16384 3202.585996 0.00003903
+ 16387 3410.545389 0.00003666
+ 24573 4461.016945 0.00004203
+ 24576 4183.724225 0.00004482
+ 24579 4243.889480 0.00004419
+ 32765 5288.958972 0.00004726
+ 32768 5328.798686 0.00004691
+ 32771 5277.353091 0.00004738
+ 49149 6339.504613 0.00005915
+ 49152 6402.924842 0.00005857
+ 49155 6480.738141 0.00005787
+ 65533 4709.518059 0.00010616
+ 65536 4613.364349 0.00010838
+ 65539 4932.498325 0.00010137
+ 98301 6066.768938 0.00012362
+ 98304 5998.359888 0.00012503
+ 98307 6098.265480 0.00012299
+ 131069 6117.938429 0.00016345
+ 131072 6324.061201 0.00015813
+ 131075 6285.324621 0.00015910
+ 196605 6557.328589 0.00022875
+ 196608 8022.864625 0.00018697
+ 196611 8524.213528 0.00017597
+ 262141 8846.468887 0.00022608
+ 262144 8678.411984 0.00023046
+ 262147 8237.604968 0.00024279
+ 393213 11383.947500 0.00026353
+ 393216 11671.364535 0.00025704
+ 393219 12134.274110 0.00024724
+ 524285 10564.415738 0.00037863
+ 524288 10541.035553 0.00037947
+ 524291 12139.945493 0.00032949
+ 786429 13031.143983 0.00046043
+ 786432 13255.902187 0.00045263
+ 786435 13528.481196 0.00044351
+ 1048573 12102.584918 0.00066101
+ 1048576 11365.676465 0.00070387
+ 1048579 13355.335488 0.00059901
+ 1572861 11314.688623 0.00106057
+ 1572864 14604.826569 0.00082165
+ 1572867 11649.668141 0.00103007
+ 2097149 8779.830027 0.00182236
+ 2097152 12092.373128 0.00132315
+ 2097155 10640.598403 0.00150368
+ 3145725 11399.940287 0.00210527
+ 3145728 11327.829590 0.00211868
+ 3145731 12330.448131 0.00194640
+ 4194301 10110.197684 0.00316512
+ 4194304 1853.739580 0.01726240
+ 4194307 8969.381449 0.00356770
+ 6291453 10336.475983 0.00464375
+ 6291456 10847.818034 0.00442485
+ 6291459 12336.471285 0.00389090
+ 8388605 13628.581728 0.00469601
+ 8388608 10168.895623 0.00629370
+ 8388611 13799.286656 0.00463792
diff --git a/docs/configs.txt b/docs/configs.txt
new file mode 100644
index 0000000..df8eeda
--- /dev/null
+++ b/docs/configs.txt
@@ -0,0 +1,3 @@
+- GlusterFS Strip size
+ For RAID 6, the stripe unit size must be chosen such that the full stripe size (stripe unit * number of data disks) is between 1 MiB and 2 MiB, preferably in the lower end of the range.
+ Hardware RAID controllers usually allow stripe unit sizes that are a power of 2. For RAID 6 with 12 disks (10 data disks), the recommended stripe unit size is 128KiB.
diff --git a/docs/databases.txt b/docs/databases.txt
index 254674e..331313b 100644
--- a/docs/databases.txt
+++ b/docs/databases.txt
@@ -7,8 +7,9 @@
Gluster MyISAM (no logs) 1 MB/s unusable 150% 600-800% Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery!
Gluster/Block MyISAM (no logs) 5 MB/s slow, but OK 200% ~ 50% No problems on reboot, but requires manual work if node crashes to detach volume.
Galera INNODB 3.5 MB/s fast 3 x 200% - Should be perfect, but I am not sure about automatic recovery...
- MySQL Slaves INNODB 6-8 exp. fast Available data is HA, but caching is not. We can easily turn the slave to master.
- DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit complete concept.
+ Galera/Hostnet INNODB 4.6 MB/s fast 3 x 200% -
+ MySQL Slaves INNODB 5-8 MB/s fast 2 x 250% - Available data is HA, but caching is not. We can easily turn the slave to master.
+ DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit the OpenShift concept that well.
Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete.
@@ -29,9 +30,13 @@ So, there is no realy a full HA capable solution at the moment. The most reasona
(i.e. status displays), current data is available. And we can easily switch the master if necessary.
The other reasonable options have some problems at the moment and can't be used.
- - Galera. Is a fine solution, but would need some degree of initial maintenance to work stabily. Furthermore, the caching is quite slow. And the
- resync is a big issue.
- - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the
+ - Galera. Is a fine solution. The caching is still quite slow. If networking problem is solved (see performance section in network.txt) or host
+ networking is used, it more-or-less on pair with Gluster/Block, but provides much better service to the data reading clients. However, extra
+ investigations are required to understand robustness of crash recovery. In some cases, after a crash Galera was performing a full resync of all
+ data (but I was re-creating statefulset which is not recommended practice, not sure if it happens if the software maintained properly). Also, at
+ some point one of the nodes was not able to join back (even after re-initializing from scratch), but again this hopefully not happening if the
+ service is not pereodically recreated.
+ - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the
MyISAM recovery is quite slow.
- HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data.
@@ -80,16 +85,21 @@ Galera
* If all nodes crashed, then again one node should restart the cluster and others join
later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'.
The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1
- in 'grstate.dat' in mysql data folder. It should be set by Galera based on some heuristic,
- but in fact I always had to set it manually. IMIMPORTANT, it should be set only on one of
- the nodes.
-
- - Synchrinization only works for INNODB tables. Furthermore, binary logging should be turned
- on (yes, it is possible to turn it off and there is no complains, but only the table names are
- synchronized, no data is pushed between the nodes).
+ in 'grstate.dat' in mysql data folder. If cluster was shat down orderly, the Galera will
+ set it automatically on the last node to stop the service. In case of a crash, however,
+ it has to be configured manually on the most up to date node. IMIMPORTANT, it should be
+ set only on one of the nodes. Otherwise, the cluster will get nearly unrecoverable.
+ * So, to recover failed cluster (unless automatic recovery works) we must revert to manual
+ procedure now. There is 'gmanager' pod which can be scalled to 3 nodes. We recover a full
+ cluster in this pods in required order. Then, we stop first node and init a statefulSet.
+ As first node in the statefulSet is ready, we stop second node in 'gmanager' and so on.
+
+ - IMPORTANT: Synchrinization only works for INNODB tables. Furthermore, binary logging should
+ be turned on (yes, it is possible to turn it off and there is no complains, but only the table
+ names are synchronized, no data is pushed between the nodes).
- OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first
- node and waits until it is running before starting next one.
+ node and waits until it is running (and ready) before starting next one.
* Now the nodes need to talk between each other. The 'headless' service is used for that.
Unlinke standard service, the DNS does not load balance service pods, but returns IPs of
all service members if appropriate DNS request is send (SRV). In Service spec we specify.
@@ -112,7 +122,33 @@ Galera
serviceName: adei-ss
There are few other minor differences. For instance, the 'selector' have more flexible notation
and should include 'matchLabels' before specifying the 'pod' selector, etc.
+
+ - IMPORTANT: If we use hostPath (or even hostPath based pv/pvc pair), the pods will be assigned
+ to the nodes randomly. This is not ideal if we want to shutdown and restart cluster. In general,
+ we always want the first pod to end-up on the same storage as it will be likely the one able to
+ boostrap. Instead, we should use 'local' volume feature (alpha in OpenShift 3.7 and should be
+ enabled in origin-node and origin-master configurations). Then, openshift 'pvc' to specific node
+ and the 'pod' executed on the node where its 'pvc' is bounded.
+
+ - IMPORTANT: StatefulSet ensures ordering and local volume data binding. Consequently, we should
+ not destroy StatefulSet object which save the state information. Otherwise, the node assignments
+ will chnage and cluster would be hard to impossible to recover.
+
+ - Another problem of our setup is slow internal network (since bridging over Infiniband is not
+ possible). One solution to overcome this is to run Galera using 'hostNetwork'. Then, however,
+ the 'peer-finder' is failing. It tries to match the service names to its 'hostname' expecting
+ that it will be in the form of 'galera-0.galera.adei.svc.cluster.local', but with host networking
+ enabled the actual hostname is used (i.e. ipekatrin1.ipe.kit.edu). I have to patch peer-finder
+ to resolve IPs and try to match the IPs.
- To check current status of the cluster
SHOW STATUS LIKE 'wsrep_cluster_size';
+
+Master/Slave replication
+========================
+ - This configuration seems more robuts, but strangely has a lot of performance issues on the
+ slave side. Network is not a problem, it is able to get logs from the master, but it is significantly
+ slower in applying it. The main performance killer is disk sync operations triggered by 'sync_binlog',
+ INNODB log flashing, etc. Disabling it allows to bring performance on reasonable level. Still,
+ the master is caching at about 6-8 MB/s and slave at 4-5 MB/s only.
\ No newline at end of file
diff --git a/docs/info.txt b/docs/info.txt
new file mode 100644
index 0000000..ea00f58
--- /dev/null
+++ b/docs/info.txt
@@ -0,0 +1,31 @@
+oc -n adei patch dc/mysql --type=json --patch '[{"op": "remove", "path": "/spec/template/spec/nodeSelector"}]'
+oc process -f mysql.yml | oc -n adei replace dc/mysql -f -
+oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq
+We use rpcbind from the host.
+we need isciinitiators, rpcbind is used for host but check with telnet. The mother volumes are provisioned 100GiB large. So we can't allocate more.
+
+We can use rpcbind (and other services) from the host. Host networking.
+oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq
+| grep -oP '^GBID:\s*\K.*'
+
+Top level (nodeSelector restarPolciy SecurityContext)
+ dnsPolicy: ClusterFirstWithHostNet
+ dnsPolicy: ClusterFirst
+ hostNetwork: true
+oc -n kaas adm policy add-scc-to-user hostnetwork -z default
+Check (in users list)
+oc get scc hostnetwork -o yaml
+firewall-cmd --add-port=5002/tcp
+
+ OnDelete: This is the default update strategy for backward-compatibility. With OnDelete update strategy, after you update a DaemonSet template, new DaemonSet pods will only be created when you manually delete old DaemonSet pods. This is the same behavior of DaemonSet in Kubernetes version 1.5 or before.
+ RollingUpdate: With RollingUpdate update strategy, after you update a DaemonSet template, old DaemonSet pods will be killed, and new DaemonSet pods will be created automatically, in a controlled fashion.
+
+Caveat: Updating DaemonSet created from Kubernetes version 1.5 or before
+.spec.updateStrategy.rollingUpdate.maxUnavailable (default to 1) and .spec.minReadySeconds
+
+
+
+ “Default”: The Pod inherits the name resolution configuration from the node that the pods run on. See related discussion for more details.
+ “ClusterFirst”: Any DNS query that does not match the configured cluster domain suffix, such as “www.kubernetes.io”, is forwarded to the upstream nameserver inherited from the node. Cluster administrators may have extra stub-domain and upstream DNS servers configured. See related discussion for details on how DNS queries are handled in those cases.
+ “ClusterFirstWithHostNet”: For Pods running with hostNetwork, you should explicitly set its DNS policy “ClusterFirstWithHostNet”.
+ “None”: A new option value introduced in Kubernetes v1.9. This Alpha feature allows a Pod to ignore DNS settings from the Kubernetes environment. All DNS settings are supposed to be provided using the dnsConfig field in the Pod Spec. See DNS config subsection below.
diff --git a/docs/infrastructure.txt b/docs/infrastructure.txt
new file mode 100644
index 0000000..dc6a57e
--- /dev/null
+++ b/docs/infrastructure.txt
@@ -0,0 +1,110 @@
+Networks
+========
+ 192.168.11.0/24 (18-port IB switch): Legacy network, non-production systems including storage
+ 192.168.12.0/24 (12-port IB swotch): KATRIN Storage network
+ 192.168.13.0/24 (12-port IB switch): HPC Cloud & Computing network
+ 192.168.26.0/24 (Ethernet): Infrastructure network (OpenShift nodes and everything else)
+ 192.168.16.0/22 External IPs for testing and production
+ 192.168.111.0/24 (OpenVPN): Gateway to Katrin network using Master1 tunnel
+ 192.168.112.0/24 (OpenVPN): Gateway to Katrin network using Master2 tunnel
+
+ 192.168.212.0/24
+ 192.168.213.0/24
+ 192.168.226.0/24 (Ethernet): Staging network (Virtual OpenShift and other nodes)
+ 192.168.216.0/22 External IPs for staging
+ 192.168.221.0/24 (OpenVPN): Gateway to Katrin network using staging Master1 tunnel
+ 192.168.222.0/24 (OpenVPN): Gateway to Katrin network using staging Master2 tunnel
+
+KIT resources
+=============
+ - ipekatrin*.ipe.kit.edu Cluster nodes
+ - ipekatrin[1:2].ipe.kit.edu Master nodes with fixed IPs (one could be dead)
+ + katrin[1:2].ipe.kit.edu Virtual IPs assigned to master nodes (HA)
+ + kaas.kit.edu (katrin.ipe.kit.edu) DNS-based load balancer between katrin[1:2].ipe.kit.edu
+ + *.kaas.kit.edu (*.katrin.ipe.kit.edu) Default application domain?
+ - katrin.kit.edu Apache/mod_proxy pod (In DNS put CN to katrin.ipe.kit.edu)
+
+ + openshift.ipe.kit.edu Gateway (VIPS) to staging cluster (Just one IP migrating between 2 nodes)
+ - *.openshift.ipe.kit.edu Default application domain for staging cluster
+
+Storage
+=======
+ LVM VGs
+ VolGroup00
+ -> LogVol*: System partitions
+ -> docker-pool: Docker storage
+ Katrin
+ -> Heketi PD (we reserve space, but do not configure heketi so far)
+ -> vg_*
+ -> Heketi-managed Gluster Volumes
+ -> Katrin (mounted at '/mnt/ands')
+ -> Space for manually-managed Gluster Bricks
+ -> Storage for Galera / Cassandra / etc.?
+
+ Gluster Volume Types:
+ tmp: disitribute ? Various data which should be preserved, but not critical if lost or temporarily inaccessible (logs, etc.) [ check if we can still write if one brick is gone ]
+ cfg: replica=3 Small and critical data sets (configs, sources, etc.)
+ cache: replica+arbiter Large re-generatable data which anyway should be always available [ potentially we can use disperse to save space ]
+ data: replica+arbiter Very large and critical data
+ db: dispersed A few very large files, like large single-table database (ADEI many tables)
+
+ Scalling storage:
+ cfg: 3 nodes is enough
+ cache/data: [d][d][a] => [da][d ][ad][ d] => [d ][d ][ d][ d][aa] => further increas in pairs, at some point add second arbiter node
+
+ Gluster Volumes:
+ provision cfg /mnt/provision Provisioning volume which is not expected to be mounted in the containers (temporarily may contain secret information, etc.)
+ openshift cfg /mnt/openshift Multi-purpose: Various small size configurations (adei, apache, etc.)
+ temporary tmp /mnt/temporary Multi-purpose: Various logs & temporary files
+ ?adei cfg /mnt/adei/adei
+ adei-db cache /mnt/adei/db
+ adei-tmp tmp /mnt/adei/tmp
+ katrin-mysql data /mnt/katrin/mysql
+ katrin-data cfg /mnt/katrin/archive
+ katrin-kali cache /mnt/katrin/storage
+ katrin-tmp tmp /mnt/katrin/workspace
+
+ OpenShift Volumes:
+ etc cfg/ro openshift Various configurations (ADEI & Apache configs, other stuff in etc.)
+ src cfg/ro openshift Interpreted source files
+ log tmp/rw tmp Suff in /var/log
+ tmp tmp/rw tmp Various temporary files
+ adei-db data/rw adei-db ADEI cache database and a few primary source [ will take ages to regenerate, so we can't consider it as dispensable cache really ]
+ adei-tmp tmp/rw adei-tmp ADEI, Apache, and Cron logs [Techically we have also downloads here which are more cache when tmp... But I think it is fine for now...]
+ adei-cfg cfg/ro adei? ADEI & Apache configs
+ adei-src cfg/ro adei? ADEI sources
+ katrin-mysql cfg/rw katrin-mysql KATRIN Database with configurations, etc.
+ katrin-data data/rw katrin-data KATRIN data archives, all primary raw data from Orca, etc.
+ katrin-kali cache/rw katrin-kali Generated ROOT files [ Can we make this separation? Marco uses hardlinks ]
+ katrin-proc tmp/rw katrin-proc Data processing volume (inbox, etc.)
+
+Services
+========
+ - Keepalived
+ - OpenVPN
+ - Gluster
+ - MySQL Galera (?)
+ - Cassandra (?)
+ - oVirt (?)
+ - OpenShift Master / Node
+ - Heketi
+ - Apache Router
+ - ADEI Services
+ - Apache Spark & etc.
+
+Inventories
+===========
+ - staging & production will be operating in parallel (staging in vagrant and production on bare-metal)
+ - testing is just pre-production tests which will be removed once production is running
+
+Labels
+======
+ - We specify if node is master and provides fat storage for glusterfs
+ - All nodes currently in 'infra' region (for example, student computers will be non-infra nodes; nodes outside of KIT as well)
+ - The servers in cellar are in 'default' zone (if we put something in the 4th floor server room, we would define a new zone there)
+
+Computing
+=========
+ - Define CUDA nodes and OpenCL nodes
+ - Intel Xeon Phi is replaced by new Tesla in the ipepdvcompute2
+ - Gen1 UFO servers does not support "Above 64G decoding" and can't run Xeon Phi. May be we can put it in new Phi server.
diff --git a/docs/links.txt b/docs/links.txt
new file mode 100644
index 0000000..003cffe
--- /dev/null
+++ b/docs/links.txt
@@ -0,0 +1,16 @@
+- PXE boot on second network interface (put a small hub for this purpose) or Mellanox Flex Boot (check)
+ https://github.com/jonschipp/vagrant/tree/master/pxe-multiboot
+ http://www.tecmint.com/multiple-centos-installations-using-kickstart/
+
+- Ovirt
+ https://docs.ansible.com/ansible/ovirt_vms_module.html
+ http://www.ovirt.org/develop/release-management/features/infra/ansible_modules/
+ https://github.com/rhevm-qe-automation/ovirt-ansible
+
+- Galera on OpenShift
+ https://github.com/openshift/origin/tree/master/examples/statefulsets/mysql/galera
+
+- CUDA on OpenShift
+ https://blog.openshift.com/use-gpus-openshift-kubernetes/
+
+
diff --git a/docs/managment.txt b/docs/managment.txt
index 9436c3c..cfc6aff 100644
--- a/docs/managment.txt
+++ b/docs/managment.txt
@@ -17,7 +17,9 @@ DOs and DONTs
openshift_enable_service_catalog: false
Then, it is left in 'Error' state, but can be easily recovered by deteleting and
allowing system to re-create a new pod.
- * However, as cause is unclear, it is possible that something else with break as time
+ * On other hand, ksc completely breakes down if kept unchanged while upgrading from
+ 3.7.1 to 3.7.2. Updating ksc fixes the problem, except the error mentioned above.
+ * As the cause is unclear, it is possible that something else will break as time
passes and new images are released. It is ADVISED to check upgrade in staging first.
* During upgrade also other system pods may stuck in Error state (as explained
in troubleshooting) and block the flow of upgrade. Just delete them and allow
@@ -34,6 +36,10 @@ DOs and DONTs
openshift_storage_glusterfs_heketi_is_missing: False
But I am not sure if it is only major issue.
+ - Master/node configuration updates cause no problems. They are executed with:
+ * playbooks/openshift-node/config.yml
+ * playbooks/openshift-master/config.yml
+
- Few administrative tools could cause troubles. Don't run
* oc adm diagnostics
diff --git a/docs/network.txt b/docs/network.txt
index bcd45e1..52c0058 100644
--- a/docs/network.txt
+++ b/docs/network.txt
@@ -70,10 +70,11 @@ Performance
4 kEUR for SX6018). License is called: UPGR-6036-GW.
- Measured performance
- Standard: ~ 3.2 Gb/s
- Standard (pods on the same node) ~ 20 - 30 Gb/s
- hostNet (using cluster IP ) ~ 3.6 Gb/s
- hostNet (using host IP) ~ 12 - 15 Gb/s
+ Standard: ~ 3.2 Gb/s 28 us
+ Standard (pods on the same node) ~ 20 - 30 Gb/s 12 us
+ hostNet (using cluster IP ) ~ 3.6 Gb/s 23 us
+ hostNet (using host IP) ~ 12 - 15 Gb/s 15 us
+ Standard to hostNet ~ 10 - 12 Gb/s 18 us
- So, I guess the optimal solution is really to introduce a second router for the cluster, but with Ethernet interface. Then, we can
reconfigure the second Infiniband adapter for the Ethernet mode. The switch to native routing should be possible also with running
diff --git a/docs/performance.txt b/docs/performance.txt
new file mode 100644
index 0000000..b31c02a
--- /dev/null
+++ b/docs/performance.txt
@@ -0,0 +1,54 @@
+Divergence from the best practices
+==================================
+ Due to various constraints, I had take some decisions contradicting the best practices. There were also some
+ hardware limitations also resulting in suboptimal conifugration.
+
+ Storage
+ -------
+ - RedHat documentation strongly discourages running Gluster over large Raid-60. The best performance is achieved
+ if disks are organized as JBOD and each assigned a brick. The problem is that heketi is not really ready for
+ production yet. I got numerous problems with testing. Managing '3 x 24' gluster bricks manually would be a nightmare.
+ Consequently, i opted for Raid-60 to simplify maintenance and ensure no data is lost due to mismanagement of gluster
+ volumes.
+
+ - In general, the architecture is more suitable for many small servers, not just a couple of fat storage servers. Then,
+ the disk load will be distributed between multiple nodes. Furthermore, we are can't use all storage with 3 nodes.
+ We need 3 nodes to ensure abitrage in case of failure (or network outtages). Even if we the 3rd node only stores the
+ checksums, we ca't easily use it to store data. OK. Technically, we can create a 3 sets of 3 bricks and put the arbiter
+ brick on different nodes. But this again will complicate maintenace. Unless proper ordering is maintained the replication
+ may happen between bricks on the same node, etc. So, again I decided to ensure fault tollerance over performance. We still
+ can use the space when cluster is scalled.
+
+ Network
+ -------
+ - To ensure high speed communication between pods running on different nodes, RedHat recommends to enable Container Native
+ Routing. This is done by creating a bridge for docker containers on the hardware network device instead of OpenVSwitch fabric.
+ Unfortunatelly, IPoIB is not providing Ethernet L2/L3 capabilities and it is impossible to use IB devices for bridging.
+ It is still may be possible to solve somehow, but further research is required. The easier solution is just to switch OpenShift
+ fabric to Ethernet. Anyway, we had idea to separate storage and OpenShift networks.
+
+ Memory
+ ------
+ - There is multiple docker storage engines. We are currently using LVM-based 'devicemapper'. To build container, the data is
+ copied from all image layers. The new 'overlay2' provides a virtual file system (overlayfs) joining all layers and performing
+ COW if the data is modified. It saves space, but more importantly it also enables page cache sharing reducing the memory
+ footprint if multiple containers sharing the same layers (and they do share CentOS base image at minimum). Another adantage a
+ slightly faster startup of containers with large images (as we don't need to copy all files). On the negative side, it is not
+ fully POSIX compliant. Some applications may have problems because. For major applications there is work-arrounds provided by
+ RedHat. But again, I opt for more standard 'devicemapper' to avoid hard to debug problems.
+
+
+What is required
+================
+ - We need to add at least another node. It will double the available storage and I expect significant improvement of storage
+ performance. Even better to have 5-6 nodes to split load.
+ - We need to switch Ethernet fabric for OpenShift network. Currently, it is not critical and will only add about 20% to ADEI
+ performance. However, it may become an issue if optimize ADEI database handling or get more network intensive applications in
+ the cluster.
+ - We need to re-evaluate RDMA support in GlusterFS. Currently, it is unreliable causing pods to hang indefinitely. If it is
+ fixed we can re-enable RDMA support for our volumes. It hopefully may further improve storage performance. Similarly, Gluster
+ block storage is significnatly faster for single-pod use case, but has significant stability issues at the moment.
+ - We need to check if OverlayFS causing any problems to applications we plan to run. Enabling overlayfs should be good for
+ our cron services and may reduce memory footprint.
+
+
diff --git a/docs/vagrant.txt b/docs/vagrant.txt
new file mode 100644
index 0000000..2cf3b43
--- /dev/null
+++ b/docs/vagrant.txt
@@ -0,0 +1,4 @@
+The staging setup is optimized to run in vagrant containers to perform tests before applying major modifications to
+production system. However, there are several pecularities to take care of
+ - Vagrant uses NAT networking on eth0 (mandatory) and generates the same IP on all nodes. This confuses the OpenShift.
+ As a solution: Customize NAT IPs and remove the default route on eth0 (configure standard dhcp on the second public interface)