Second revision: includes hostpath mounts, gluster block storage, kaas apps, etc.

author: Suren A. Chilingaryan <csa@suren.me> 2018-03-18 22:59:31 +0100
committer: Suren A. Chilingaryan <csa@suren.me> 2018-03-18 22:59:31 +0100
commit: 47f350bc3aa85a8bd406d95faf084df2abf74ae9 (patch)
tree: 72ad1e91bac46d3457f89781dc90f0d6c1c074d5 /docs
parent: 006f333828db373435daa15483d2ab753048f62a (diff)
download: ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.gz
ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.bz2
ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.xz
ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.zip
8 files changed, 581 insertions, 2 deletions
diff --git a/docs/benchmarks/netpipe-hostnet-clusterip.txt b/docs/benchmarks/netpipe-hostnet-clusterip.txt
new file mode 100644
index 0000000..452a59b
--- /dev/null
+++ b/docs/benchmarks/netpipe-hostnet-clusterip.txt
@@ -0,0 +1,124 @@
+       1 0.326958   0.00002333
+       2 0.618103   0.00002469
+       3 0.938810   0.00002438
+       4 1.140540   0.00002676
+       6 1.810863   0.00002528
+       8 2.440223   0.00002501
+      12 4.014228   0.00002281
+      13 4.244945   0.00002336
+      16 5.265391   0.00002318
+      19 6.288176   0.00002305
+      21 6.873981   0.00002331
+      24 7.626945   0.00002401
+      27 8.464006   0.00002434
+      29 9.132165   0.00002423
+      32 10.071211   0.00002424
+      35 11.350914   0.00002352
+      45 14.462787   0.00002374
+      48 15.551991   0.00002355
+      51 16.617742   0.00002341
+      61 17.937454   0.00002595
+      64 19.940809   0.00002449
+      67 19.670239   0.00002599
+      93 27.923937   0.00002541
+      96 30.696302   0.00002386
+      99 31.378657   0.00002407
+     125 39.836959   0.00002394
+     128 39.851527   0.00002451
+     131 41.123237   0.00002430
+     189 59.994843   0.00002403
+     192 52.568072   0.00002787
+     195 60.825254   0.00002446
+     253 80.036908   0.00002412
+     256 78.572397   0.00002486
+     259 82.453495   0.00002397
+     381 121.586010   0.00002391
+     384 122.643994   0.00002389
+     387 119.587204   0.00002469
+     509 152.953007   0.00002539
+     512 156.751101   0.00002492
+     515 147.461225   0.00002665
+     765 223.078148   0.00002616
+     768 245.267636   0.00002389
+     771 252.201504   0.00002332
+    1021 291.274243   0.00002674
+    1024 288.122902   0.00002712
+    1027 314.918782   0.00002488
+    1533 455.244190   0.00002569
+    1536 440.315120   0.00002661
+    1539 444.559116   0.00002641
+    2045 408.429719   0.00003820
+    2048 408.361919   0.00003826
+    2051 403.367349   0.00003879
+    3069 598.249055   0.00003914
+    3072 608.830745   0.00003850
+    3075 605.542765   0.00003874
+    4093 691.850246   0.00004514
+    4096 712.818517   0.00004384
+    4099 698.066606   0.00004480
+    6141 907.399011   0.00005163
+    6144 909.222865   0.00005156
+    6147 920.949895   0.00005092
+    8189 1106.942777   0.00005644
+    8192 1118.442648   0.00005588
+    8195 1138.471355   0.00005492
+   12285 1456.435686   0.00006435
+   12288 1473.988562   0.00006360
+   12291 1432.599036   0.00006546
+   16381 1672.451589   0.00007473
+   16384 1687.110914   0.00007409
+   16387 1594.364859   0.00007842
+   24573 1820.900468   0.00010296
+   24576 1927.109643   0.00009730
+   24579 1925.807752   0.00009737
+   32765 2039.948799   0.00012254
+   32768 2264.455285   0.00011040
+   32771 2053.911942   0.00012173
+   49149 2329.879339   0.00016094
+   49152 2251.567470   0.00016655
+   49155 2376.570618   0.00015780
+   65533 2087.316837   0.00023953
+   65536 2090.007791   0.00023923
+   65539 2240.546493   0.00022317
+   98301 2261.214485   0.00033167
+   98304 2236.528922   0.00033534
+   98307 2267.504025   0.00033077
+  131069 2506.301596   0.00039899
+  131072 2574.001159   0.00038850
+  131075 2499.398059   0.00040011
+  196605 2679.266208   0.00055985
+  196608 2577.617790   0.00058193
+  196611 2655.729790   0.00056483
+  262141 2866.098615   0.00069780
+  262144 2952.146697   0.00067747
+  262147 2921.582565   0.00068457
+  393213 3280.847971   0.00091439
+  393216 3145.640621   0.00095370
+  393219 3190.458883   0.00094031
+  524285 3293.829390   0.00121439
+  524288 3395.057727   0.00117818
+  524291 3213.113808   0.00124491
+  786429 3433.707485   0.00174738
+  786432 3564.531089   0.00168325
+  786435 3343.824065   0.00179436
+ 1048573 3461.698116   0.00231100
+ 1048576 3411.340450   0.00234512
+ 1048579 3603.069459   0.00222034
+ 1572861 3253.106873   0.00368877
+ 1572864 3502.997228   0.00342564
+ 1572867 3457.981793   0.00347024
+ 2097149 3331.709227   0.00480233
+ 2097152 3296.412690   0.00485376
+ 2097155 3635.046705   0.00440160
+ 3145725 3713.207547   0.00646341
+ 3145728 3398.330126   0.00706229
+ 3145731 3455.172928   0.00694611
+ 4194301 3667.624405   0.00872499
+ 4194304 3244.050612   0.00986421
+ 4194307 3076.097496   0.01040280
+ 6291453 3498.607536   0.01371974
+ 6291456 3473.348924   0.01381952
+ 6291459 3392.586555   0.01414850
+ 8388605 3522.519503   0.01816881
+ 8388608 3694.116745   0.01732484
+ 8388611 3279.530110   0.01951500
diff --git a/docs/benchmarks/netpipe-hostnet-hostip.txt b/docs/benchmarks/netpipe-hostnet-hostip.txt
new file mode 100644
index 0000000..494289d
--- /dev/null
+++ b/docs/benchmarks/netpipe-hostnet-hostip.txt
@@ -0,0 +1,124 @@
+       1 0.518634   0.00001471
+       2 1.076815   0.00001417
+       3 1.516004   0.00001510
+       4 2.144302   0.00001423
+       6 3.214913   0.00001424
+       8 4.166316   0.00001465
+      12 6.289028   0.00001456
+      13 6.744819   0.00001470
+      16 8.533129   0.00001431
+      19 10.729789   0.00001351
+      21 11.221507   0.00001428
+      24 13.006231   0.00001408
+      27 14.385727   0.00001432
+      29 15.045281   0.00001471
+      32 15.624612   0.00001563
+      35 17.163339   0.00001556
+      45 23.883358   0.00001437
+      48 25.888550   0.00001415
+      51 26.756400   0.00001454
+      61 33.091611   0.00001406
+      64 34.624658   0.00001410
+      67 35.725842   0.00001431
+      93 49.945160   0.00001421
+      96 50.993414   0.00001436
+      99 52.297713   0.00001444
+     125 66.283502   0.00001439
+     128 68.269142   0.00001430
+     131 70.562884   0.00001416
+     189 97.002458   0.00001487
+     192 101.594727   0.00001442
+     195 102.730659   0.00001448
+     253 136.727875   0.00001412
+     256 126.018401   0.00001550
+     259 123.015865   0.00001606
+     381 187.900047   0.00001547
+     384 182.281966   0.00001607
+     387 187.505026   0.00001575
+     509 261.417114   0.00001486
+     512 243.113544   0.00001607
+     515 248.983291   0.00001578
+     765 411.716257   0.00001418
+     768 404.020586   0.00001450
+     771 393.655066   0.00001494
+    1021 523.139717   0.00001489
+    1024 528.470385   0.00001478
+    1027 500.212191   0.00001566
+    1533 709.342507   0.00001649
+    1536 745.302649   0.00001572
+    1539 748.867382   0.00001568
+    2045 701.949001   0.00002223
+    2048 641.125753   0.00002437
+    2051 725.006704   0.00002158
+    3069 1149.080142   0.00002038
+    3072 1117.869559   0.00002097
+    3075 1066.876626   0.00002199
+    4093 1234.821755   0.00002529
+    4096 1392.067164   0.00002245
+    4099 1364.273095   0.00002292
+    6141 1978.643297   0.00002368
+    6144 2001.046782   0.00002343
+    6147 1981.921823   0.00002366
+    8189 2528.235274   0.00002471
+    8192 2421.728225   0.00002581
+    8195 2545.005298   0.00002457
+   12285 3266.040928   0.00002870
+   12288 3574.432125   0.00002623
+   12291 3525.409573   0.00002660
+   16381 4179.351534   0.00002990
+   16384 4412.705178   0.00002833
+   16387 4302.765466   0.00002906
+   24573 5694.202878   0.00003292
+   24576 5592.149917   0.00003353
+   24579 5454.255077   0.00003438
+   32765 5895.412790   0.00004240
+   32768 5999.816085   0.00004167
+   32771 6242.962316   0.00004005
+   49149 7676.810025   0.00004885
+   49152 8149.771111   0.00004601
+   49155 7758.390037   0.00004834
+   65533 6147.722626   0.00008133
+   65536 6261.159737   0.00007986
+   65539 6070.463017   0.00008237
+   98301 7584.744000   0.00009888
+   98304 7358.504685   0.00010192
+   98307 7510.619199   0.00009986
+  131069 8733.644100   0.00011450
+  131072 9127.358840   0.00010956
+  131075 8955.343297   0.00011167
+  196605 10044.567820   0.00014933
+  196608 10429.810847   0.00014382
+  196611 11000.068259   0.00013636
+  262141 11544.030028   0.00017325
+  262144 13201.924092   0.00015149
+  262147 13291.843558   0.00015047
+  393213 13995.239140   0.00021436
+  393216 13089.487805   0.00022919
+  393219 13791.682522   0.00021752
+  524285 13814.348542   0.00028955
+  524288 12914.849923   0.00030972
+  524291 12709.636315   0.00031472
+  786429 14904.555026   0.00040256
+  786432 18599.059647   0.00032260
+  786435 15589.196556   0.00038488
+ 1048573 18949.720391   0.00042217
+ 1048576 17809.447022   0.00044920
+ 1048579 14751.871863   0.00054231
+ 1572861 15167.107603   0.00079118
+ 1572864 15495.174755   0.00077443
+ 1572867 13418.656978   0.00089428
+ 2097149 12213.237268   0.00131005
+ 2097152 12667.175804   0.00126311
+ 2097155 14579.639694   0.00109742
+ 3145725 12379.481055   0.00193869
+ 3145728 13692.207318   0.00175282
+ 3145731 15210.394578   0.00157787
+ 4194301 13674.480625   0.00234012
+ 4194304 13844.661275   0.00231136
+ 4194307 13219.895707   0.00242060
+ 6291453 11450.993213   0.00419177
+ 6291456 12668.370717   0.00378896
+ 6291459 11095.094395   0.00432624
+ 8388605 13020.052869   0.00491549
+ 8388608 13845.907563   0.00462230
+ 8388611 11884.989086   0.00538495
diff --git a/docs/benchmarks/netpipe-standard.txt b/docs/benchmarks/netpipe-standard.txt
new file mode 100644
index 0000000..d6a0cc9
--- /dev/null
+++ b/docs/benchmarks/netpipe-standard.txt
@@ -0,0 +1,124 @@
+       1 0.271739   0.00002808
+       2 0.527352   0.00002893
+       3 0.864740   0.00002647
+       4 1.172232   0.00002603
+       6 1.830063   0.00002501
+       8 2.420999   0.00002521
+      12 3.820573   0.00002396
+      13 4.073067   0.00002435
+      16 5.173979   0.00002359
+      19 5.999408   0.00002416
+      21 6.502058   0.00002464
+      24 7.103063   0.00002578
+      27 8.566173   0.00002405
+      29 8.598646   0.00002573
+      32 10.166434   0.00002401
+      35 10.302855   0.00002592
+      45 14.213910   0.00002415
+      48 14.378267   0.00002547
+      51 14.637744   0.00002658
+      61 18.021413   0.00002582
+      64 18.878820   0.00002586
+      67 21.798199   0.00002345
+      93 30.280693   0.00002343
+      96 30.892160   0.00002371
+      99 28.912132   0.00002612
+     125 39.606498   0.00002408
+     128 40.560404   0.00002408
+     131 41.434631   0.00002412
+     189 61.836905   0.00002332
+     192 61.110074   0.00002397
+     195 62.601410   0.00002377
+     253 82.003349   0.00002354
+     256 78.382060   0.00002492
+     259 76.431690   0.00002585
+     381 126.142014   0.00002304
+     384 121.997385   0.00002401
+     387 122.661813   0.00002407
+     509 148.937476   0.00002607
+     512 155.770935   0.00002508
+     515 163.799277   0.00002399
+     765 229.839666   0.00002539
+     768 231.459755   0.00002531
+     771 239.810229   0.00002453
+    1021 302.868551   0.00002572
+    1024 298.703317   0.00002615
+    1027 311.172883   0.00002518
+    1533 444.020226   0.00002634
+    1536 447.831634   0.00002617
+    1539 451.182634   0.00002602
+    2045 401.368200   0.00003887
+    2048 380.786363   0.00004103
+    2051 394.082308   0.00003971
+    3069 607.290098   0.00003856
+    3072 599.903348   0.00003907
+    3075 567.715333   0.00004132
+    4093 714.630832   0.00004370
+    4096 674.709949   0.00004632
+    4099 688.044295   0.00004545
+    6141 907.731892   0.00005161
+    6144 911.516656   0.00005143
+    6147 911.682774   0.00005144
+    8189 972.290335   0.00006426
+    8192 1090.124017   0.00005733
+    8195 1058.496177   0.00005907
+   12285 1349.474209   0.00006945
+   12288 1368.770226   0.00006849
+   12291 1370.611598   0.00006842
+   16381 1717.159412   0.00007278
+   16384 1625.251103   0.00007691
+   16387 1622.023570   0.00007708
+   24573 1889.056976   0.00009924
+   24576 1864.732089   0.00010055
+   24579 1877.212570   0.00009989
+   32765 2231.157775   0.00011204
+   32768 2152.925316   0.00011612
+   32771 2084.435045   0.00011995
+   49149 2283.518678   0.00016421
+   49152 2318.047630   0.00016177
+   49155 2335.055110   0.00016061
+   65533 2043.928666   0.00024462
+   65536 2014.455634   0.00024821
+   65539 2110.398618   0.00023693
+   98301 2183.428273   0.00034349
+   98304 2177.638569   0.00034441
+   98307 2169.611321   0.00034569
+  131069 2355.683170   0.00042450
+  131072 2390.702707   0.00041829
+  131075 2413.261164   0.00041439
+  196605 2282.562339   0.00065715
+  196608 2494.585589   0.00060130
+  196611 2406.210727   0.00062340
+  262141 2955.537329   0.00067669
+  262144 3020.178557   0.00066221
+  262147 3024.809433   0.00066121
+  393213 3010.209455   0.00099660
+  393216 3210.869736   0.00093433
+  393219 3005.822496   0.00099807
+  524285 3055.047980   0.00130930
+  524288 3319.176826   0.00120512
+  524291 3354.251597   0.00119252
+  786429 3411.484135   0.00175876
+  786432 3446.052653   0.00174112
+  786435 3262.586754   0.00183904
+ 1048573 3323.591745   0.00240703
+ 1048576 3399.406018   0.00235335
+ 1048579 3472.808936   0.00230362
+ 1572861 3406.392100   0.00352278
+ 1572864 3306.084582   0.00362967
+ 1572867 3370.341516   0.00356048
+ 2097149 3361.769733   0.00475939
+ 2097152 3280.636487   0.00487710
+ 2097155 3191.766247   0.00501291
+ 3145725 3162.564558   0.00758877
+ 3145728 3355.820730   0.00715175
+ 3145731 3327.611546   0.00721239
+ 4194301 3386.779090   0.00944850
+ 4194304 3387.249473   0.00944719
+ 4194307 3441.420898   0.00929849
+ 6291453 3268.329719   0.01468639
+ 6291456 3201.892445   0.01499113
+ 6291459 3244.450787   0.01479450
+ 8388605 3271.733339   0.01956149
+ 8388608 3182.658022   0.02010898
+ 8388611 3253.521074   0.01967100
diff --git a/docs/consistency.txt b/docs/consistency.txt
index 127d9a7..c648a9a 100644
--- a/docs/consistency.txt
+++ b/docs/consistency.txt
@@ -22,6 +22,11 @@ Storage
 
 Networking
 ==========
+ - Check that correct upstream name servers are listed for both DNSMasq (host) and SkyDNS (pods).
+ If not fix and restart 'origin-node' and 'dnsmasq'.
+    * '/etc/dnsmasq.d/origin-upstream-dns.conf'
+    * '/etc/origin/node/resolv.conf'
+
  - Check that both internal and external addresses are resolvable from all hosts.
     * I.e. we should be able to resolve 'google.com'
     * And we should be able to resolve 'heketi-storage.glusterfs.svc.cluster.local'
diff --git a/docs/databases.txt b/docs/databases.txt
new file mode 100644
index 0000000..254674e
--- /dev/null
+++ b/docs/databases.txt
@@ -0,0 +1,118 @@
+- The storage for HA datbases is problematic. There is several ways to organize storage. I list major 
+  characteristics here (INNODB is generally faster, but takes about 20% more disk space. Initially it
+  significantly faster and takes 5x disk space, but it normalizes...)
+
+ Method         Database                Performance     Clnt/Cache      MySQL           Gluster         HA
+ HostMount      MyISAM/INNODB           8 MB/s          fast            250%            -               Nope. But otherwise least problems to run.        
+ Gluster        MyISAM (no logs)        1 MB/s          unusable        150%            600-800%        Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery!
+ Gluster/Block  MyISAM (no logs)        5 MB/s          slow, but OK    200%            ~ 50%           No problems on reboot, but requires manual work if node crashes to detach volume.
+ Galera         INNODB                  3.5 MB/s        fast            3 x 200%        -               Should be perfect, but I am not sure about automatic recovery...
+ MySQL Slaves   INNODB                  6-8 exp.        fast                                            Available data is HA, but caching is not. We can easily turn the slave to master.
+ DRBD           MyISAM (no logs)        4-6 exp.        ?                                               I expect it as an faster option, but does not fit complete concept.
+ 
+
+ Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete.
+The Gluster/Block is faster, but HA suffers. The volume is attached to the pod running on crashed node. It seems not detached automatically until
+the failed pod (in Unknown state) is killed with 
+    oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq
+Then, after some delay it is re-attached to the new running pod. Technically, we can run kind of monitoring service which will detect such nodes
+and restart. Still, this solution is limited to MyISAM with binary logging disabled. Unlike simple Gluster solution, the clients may use the system
+while caching is going, but is quite slow. The main trouble is MyISAM corruption, the recovery is slow. 
+
+ Galera is slower when Gluster/Block, but is fully available. The clients have also more servers to query data from. The cluster start-up is a bit
+tricky and I am not sure that everything will work smoothely now. Some tunning may be necessary. Furthermore, it seems if cluster is crashed, we 
+can recover from one of the nodes, but all the data will be destroyed on other members and they would pull the complete dataset. The synchronization
+is faster when caching (~ 140 MB/s), but it wil still take about 10 hours to synchronize 5 TB of KATRIN data.
+ 
+So, there is no realy a full HA capable solution at the moment. The most reasonable seems compromising on caching HA.
+ - MySQL with slaves.  The asynchronous replication should be significantly faster when Galera. The passthrough to source databases will be working 
+ (i.e. status displays), current data is available. And we can easily switch the master if necessary.
+
+The other reasonable options have some problems at the moment and can't be used.
+ - Galera. Is a fine solution, but would need some degree of initial maintenance to work stabily. Furthermore, the caching is quite slow. And the 
+ resync is a big issue.
+ - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the
+ MyISAM recovery is quite slow.
+ - HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data.
+
+Gluster/Block
+=============
+ The idea is pretty simple. A standard gluster file system is used to store a 'block' files (just a normal files). This files are used as block devices
+ with single-pod access policy. GFApi interface is used to access the data on Gluster (avoiding context switches) and is exposed over iSCSI to the clients.
+
+ There are couple of problems with configuration and run-time.
+ - The default Gluster containers while complain about rpcbind. We are using host networking in this case and the required ports (111) between container
+ and the host system conflicts. We, however, are able just to use the host rpcbind. Consequently, the rpcbind should be removed from the Gluster container
+ and the requirements removed from gluster-blockd systemd service. It is still worth checking that the port is accessible from the container (but it 
+ should). We additionally also need 'iscsi-initiator-utils' in the container.
+
+ - Only a single pod should have access to the block device. Consequnetly, when the volume is attached to the client, other pods can't use it any more.
+ The problem starts if node running pod dies. It is not perfectly handled by OpenShift now. The volume remains attached to the pod in the 'Unknown' state
+ until it manually killed. Only, then, after another delay it is detached and available for replacement pod (which will struggle in ConteinerCreating 
+ phase until then). The pods in 'Unknown' state is not easy to kill. 
+    oc delete --force --grace-period=0 pod/mysql-1-m4wcq
+
+ - Heketi is buggy. 
+  * If something goes wrong, it starts create multitudes of Gluster volumes and finally crashes with broken database. It is possible to remove the 
+  volumes and recover database from backup, but it is time consuming and unreliable for HA solution.
+  * Particularly, this happens if we try to allocate more disk-space when available. The OpenShift configures the size of Gluster file system used
+  to back block devices. It is 100 GB by default. If we specify 500Gi in pvc, it will try to create 15 such devices (another maximum configured by
+  openshift) before crashing.
+  * Overall, I'd rather only use the manual provisioning.
+
+ - Also without heketi it is still problematic (may be it is better with official RH container running on GlusterFS 3.7), but I'd not check... We
+ can try again with GlusterFS 4.1. There are probably multiple problems, but
+  * GlusterFS may fail on one of the nodes (showing it up and running). If any of the block services have problems communicating with local gluster
+  daemon, most requests (info/list will still work, but slow) to gluster daemon will timeout.
+  
+Galera
+======
+ - To bring new cluster up, there is several steps.
+  * All members need to initialize standard standalone databases
+  * One node should perform initialization and other nodes join after it is completed.
+  * The nodes will delete their mysql folders and re-synchronize from the first node.
+  * Then, cluster will be up and all nodes in so called primary state.
+
+ - The procedure is similar for crash recovery:
+  * If a node leaves the cluster, it may just come back and be re-sycnronized from other
+  cluster members if there is a quorum. For this reason, it is necessary to keep at le
+  ast 3 nodes running.
+  * If all nodes crashed, then again one node should restart the cluster and others join
+  later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'.
+  The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1
+  in 'grstate.dat' in mysql data folder. It should be set by Galera based on some heuristic,
+  but in fact I always had to set it manually. IMIMPORTANT, it should be set only on one of 
+  the nodes.
+ 
+ - Synchrinization only works for INNODB tables. Furthermore, binary logging should be turned
+ on (yes, it is possible to turn it off and there is no complains, but only the table names are
+ synchronized, no data is pushed between the nodes).
+ 
+ - OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first
+ node and waits until it is running before starting next one. 
+  * Now the nodes need to talk between each other. The 'headless' service is used for that. 
+  Unlinke standard service, the DNS does not load balance service pods, but returns IPs of
+  all service members if appropriate DNS request is send (SRV). In Service spec we specify.
+        clusterIP: None                                 - old version
+  For clients we still need a load-balancing service. So, we need to add a second service 
+  to serve their needs.
+  * To decide if it should perform cluster initialization, the node tries to resolve members
+  of the service. If it is alone, it initializes the cluster. Otherwise, tries to join the other
+  members already registered in the service. The problem is that by default, OpenShift only
+  will add member when it is ready (Readyness check). Consequently, all nodes will try to 
+  initialize. There is two methods to prevent it. One is working up to 3.7 and other 3.8 up, 
+  but it is no harm to use both for now). 
+    The new is to set in Service spec:
+        publishNotReadyAddresses: True
+    The old is to specify in Service metadata.annotations:
+        service.alpha.kubernetes.io/tolerate-unready-endpoints: true
+  * Still, we should quickly check for peers until other pods had chance to start.
+  * Furthermore, there is some differneces to 'dc' definition. We need to specify 'serviceName'
+  in the StatefulSet spec.
+      serviceName: adei-ss
+  There are few other minor differences. For instance, the 'selector' have more flexible notation
+  and should include 'matchLabels' before specifying the 'pod' selector, etc.
+  
+ - To check current status of the cluster
+        SHOW STATUS LIKE 'wsrep_cluster_size';
+ 
+\ No newline at end of file
diff --git a/docs/managment.txt b/docs/managment.txt
index 1eca8a8..9436c3c 100644
--- a/docs/managment.txt
+++ b/docs/managment.txt
@@ -96,17 +96,23 @@ Problems
 
 Storage / Recovery
 =======
+ - We have some manually provisioned resources which needs to be fixed.
+    * GlusterFS endpoints should be pointing to new nodes.
+    * If use use Gluster/Block storage all 'pv' refer iscsi 'portals'. They also has to be apdated to 
+    new server names. I am not sure how this handled for auto-provisioned resources.
  - Furthermore, it is necessary to add glusterfs nodes on a new storage nodes. It is not performed 
  automatically by scale plays. The 'glusterfs' play should be executed with additional options
  specifying that we are just re-configuring nodes. We can check if all pods are serviced
     oc -n glusterfs get pods -o wide
  Both OpenShift and etcd clusters should be in proper state before running this play. Fixing and re-running
  should be not an issue.
- 
+
  - More details:
     https://docs.openshift.com/container-platform/3.7/day_two_guide/host_level_tasks.html
 
 
+
+
 Heketi
 ------
  - With heketi things are straighforward, we need to mark node broken. Then heketi will automatically move the
@@ -160,7 +166,13 @@ KaaS Volumes
  
 Scaling
 =======
-We have currently serveral assumptions which will probably not hold true for larger clusters
+ - If we use container native routing, we need to add routes to new nodes on the Infiniband routes, 
+ see docs:
+    https://docs.openshift.com/container-platform/3.7/install_config/configuring_native_container_routing.html#install-config-configuring-native-container-routing
+ Basically, the Infiniband switch should send packets destined to the network 11.11.<hostid>.0/24 to corresponding node, i.e. 192.168.13.<hostid>
+    
+We also have currently serveral assumptions which will probably not hold true for larger clusters
  - Gluster
     To simplify matters we just reference servers in the storage group manually
     Arbiter may work for several groups and we should define several brick path in this case
+
diff --git a/docs/network.txt b/docs/network.txt
index a164d36..bcd45e1 100644
--- a/docs/network.txt
+++ b/docs/network.txt
@@ -56,3 +56,26 @@ Hostnames
  The linux host name (uname -a) should match the hostnames assigned to openshift nodes. Otherwise, the certificate verification
  will fail. It seems minor issue as system continue functioning, but better to avoid. The check can be performed with etcd:
     etcdctl3  --key=/etc/etcd/peer.key --cacert=/etc/etcd/ca.crt --endpoints="192.168.213.1:2379,192.168.213.3:2379,192.168.213.4:2379"
+
+Performance
+===========
+ - Redhat recommends using Native Container Routing for speeds above 1Gb/s. It creates a new bridge connected to fast fabric and docker
+ configured to use it instead of docker0 bridge. The docker0 is routed trough the OpenVSwich fabric and the new bridge should go directly.
+ Unfortunatelly, this is not working with Infiniband. IPoIB is not fully Ethernet compatible and is not working as slave in bridges. 
+  * There is projects for full Ethernet compatibility (eipoib) providing Ethernet L2 interfaces.  But it seems there is no really mature 
+  solution ready for production. It also penalyzes performance (about 2x).
+  * Mellanox cards working in both Ethernet and Infiniband modes. No problem to select the current mode with:
+     echo "eth|ib|auto" >  /sys/bus/pci/devices/0000\:06\:00.0/mlx4_port1
+  However, while the switch support Ethernet, it requires additional license basically for 50% of the original switch price (it is about
+  4 kEUR for SX6018). License is called: UPGR-6036-GW.
+
+ - Measured performance
+    Standard:                           ~ 3.2 Gb/s
+    Standard (pods on the same node)    ~ 20 - 30 Gb/s
+    hostNet (using cluster IP )         ~ 3.6 Gb/s
+    hostNet (using host IP)             ~ 12 - 15 Gb/s
+  
+  - So, I guess the optimal solution is really to introduce a second router for the cluster, but with Ethernet interface. Then, we can
+  reconfigure the second Infiniband adapter for the Ethernet mode. The switch to native routing should be possible also with running
+  cluster with short downtime. As temporary solution, we may use hostNetwork.
+  
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index b4ac8e7..ef3c206 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -60,6 +60,8 @@ Debugging
         oc logs <pod name> --tail=100 [-p]                  - dc/name or ds/name as well
  - Verify initialization steps (check if all volumes are mounted)
         oc describe <pod name>
+ - Security (SCC) problems are visible if replica controller is queried
+        oc -n adei get  rc/mysql-1 -o yaml
  - It worth looking the pod environment
         oc env po <pod name> --list
  - It worth connecting running container with 'rsh' session and see running processes,
@@ -85,6 +87,7 @@ network
     * that nameserver is pointing to the host itself (but not localhost, this is important
     to allow running pods to use it)
     * that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf'
+    * that correct upstream nameservers are listed in '/etc/origin/node/resolv.conf'
     * In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons)
  If script misbehaves, it is possible to call it manually like that
     DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up
@@ -96,6 +99,7 @@ etcd (and general operability)
  may be needed to restart them manually. I have noticed it with 
     * lvm2-lvmetad.socket       (pvscan will complain on problems)
     * node-origin
+    * glusterd in container     (just kill the misbehaving pod, it will be recreated)
     * etcd               but BEWARE of too entusiastic restarting:
  - However, restarting etcd many times is BAD as it may trigger a severe problem with 
  'kube-service-catalog/apiserver'. The bug description is here
@@ -181,6 +185,13 @@ pods (failed pods, rogue namespaces, etc...)
                 docker ps -aq --no-trunc | xargs docker rm
 
 
+Builds
+======
+ - After changing storage for integrated docker registry, it may refuse builds with HTTP error 500. It is necessary
+ to run:
+    oadm policy reconcile-cluster-roles
+
+
 Storage
 =======
  - Running a lot of pods may exhaust available storage. It worth checking if 
@@ -208,3 +219,41 @@ Storage
         gluster volume start <vol>
     * This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd'
     
+ - If something gone wrong, heketi may end-up creating a bunch of new volumes, corrupt database, and crash
+ refusing to start. Here is the recovery procedure.
+    * Sometimes, it is still possible to start by setting 'HEKETI_IGNORE_STALE_OPERATIONS' environmental
+    variable on the container.
+        oc -n glusterfs env dc  heketi-storage -e HEKETI_IGNORE_STALE_OPERATIONS=true
+    * Even if it works, it does not solve the main issue with corruption. It is necessary to start a 
+    debugging pod for heketi (oc debug) export corrupted databased, fix it, and save back. Having
+    database backup could save a lot of hussle to find that is amiss.
+        heketi db export --dbfile heketi.db --jsonfile /tmp/q.json
+        oc cp glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q.json q.json
+        cat q.json | python -m json.tool > q2.json
+        ...... Fixing .....
+        oc cp q2.json glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q2.json 
+        heketi db import --dbfile heketi2.db --jsonfile /tmp/q2.json
+        cp heketi2.db /var/lib/heketi/heketi.db
+    * If bunch of disks is created, there are still various left-overs. First, the Gluster volumes
+    have to be cleaned. The idea is to compare 'vol_' prefixed volumes in Heketi and Gluster. And
+    remove ones not present in heketi. There is the script in 'ands/scripts'.
+    * There is LVM volumes left from Gluster (or even allocated, but not associated with Gluster for
+    various failurs. so this clean-up is worth making independently). On each node we can easily find
+    volumes created today
+        lvdisplay -o name,time -S 'time since "2018-03-16"'
+    or again we can compare lvm volumes which are used by Gluster bricks and which are not. The later
+    ones should be cleaned up. Again there is the script.
+     
+Performance
+===========
+ - To find if OpenShift restricts the usage of system resources, we can 'rsh' to container and check
+ cgroup limits in sysfs
+    /sys/fs/cgroup/cpuset/cpuset.cpus
+    /sys/fs/cgroup/memory/memory.limit_in_bytes
+
+
+Various
+=======
+ - IPMI may cause problems as well. Particularly, the mounted CDrom may start complaining. Easiest is
+ just to remove it from the running system with
+     echo 1 > /sys/block/sdd/device/delete
author	Suren A. Chilingaryan <csa@suren.me>	2018-03-18 22:59:31 +0100
committer	Suren A. Chilingaryan <csa@suren.me>	2018-03-18 22:59:31 +0100
commit	47f350bc3aa85a8bd406d95faf084df2abf74ae9 (patch)
tree	72ad1e91bac46d3457f89781dc90f0d6c1c074d5 /docs
parent	006f333828db373435daa15483d2ab753048f62a (diff)
download	ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.gz ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.bz2 ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.tar.xz ands-47f350bc3aa85a8bd406d95faf084df2abf74ae9.zip