diff options
| author | Suren A. Chilingaryan <csa@suren.me> | 2025-10-27 20:16:31 +0100 | 
|---|---|---|
| committer | Suren A. Chilingaryan <csa@suren.me> | 2025-10-27 20:16:31 +0100 | 
| commit | d35216ee0cbf9f1a84a6d4151daf870b1ff00395 (patch) | |
| tree | 4f8d0b6fc4878c50cb2d82ada9837f4ab4b75f36 | |
| parent | e4ef3c8c1bbf0e4ebde8a5479fd7b79180080970 (diff) | |
| download | ands-d35216ee0cbf9f1a84a6d4151daf870b1ff00395.tar.gz ands-d35216ee0cbf9f1a84a6d4151daf870b1ff00395.tar.bz2 ands-d35216ee0cbf9f1a84a6d4151daf870b1ff00395.tar.xz ands-d35216ee0cbf9f1a84a6d4151daf870b1ff00395.zip | |
Document current katrin1 problems along with debugging performed and draft the action planHEADmaster
| -rw-r--r-- | log.txt | 80 | 
1 files changed, 75 insertions, 5 deletions
| @@ -1,11 +1,81 @@ -Hardware --------- - - ipekatrin1: Replaced disk in section 9. LSI software reports all is OK, but hardware led indicates a error (red). Probably indicator is broken. + System + ------- + 2025.09.28 +    - ipekatrin1: +        * Raid controller don't see 10 disks and behaves erratically. +        * Turned of the server and ordered a replacement. +    - Sotrage: +        * Restarted degraded GlusterFS nodes and make them work on remaining 2 nodes (1 replica + metadata for most of our storage needs). +        * Turned out 'database' volume is created in Raid-0 mode and it used backend for KDB database. So, data is gone. +        * Recovered KDB database from backups and moved it to glusterfs/openshift volume. Nothing left on 'database' volume. Can be turned off. + + 2025.10.27 +    - ipekatrin1: +        * Disconnected all disks from the server and start preparing it as an application node +    - Software: +        * I have temporarily suspended all ADEI cronJobs to avoid resource contention on ipekatrin2 (as restart would be dangerous now) [clean (logs,etc.)/maintain (re-caching,etc.)/update(detecting new databases)] +    - Research: +        * DaemonSet/GlusterFS selects nodes based on the following nodeSelector +            $ oc -n glusterfs get ds glusterfs-storage -o yaml | grep -B 5 -A 5 nodeSelector  +                  nodeSelector: +                    glusterfs: storage-host +          All nodes has corresponding labels in their metadata: +            $ oc get node/ipekatrin1.ipe.kit.edu --show-labels -o yaml | grep  -A 20 labels: +                  labels: +                    ... +                    glusterfs: storage-host +                    ... +        * Thats removed now from ipekatrin1 and should be recovered if we bring storage back +            oc label --dry-run node/ipekatrin1.ipe.kit.edu glusterfs- +        * We further need to remove 192.168.12.1 from 'endpoints/gfs' (per namespaces) to avoid possible problems.  +        * On ipekatrin1, /etc/fstab glusterfs mounts should be changed from 'localhost' to some other server (or commented all-together). GlusterFS mounts  +        should be changed from localhost to +            192.168.12.2,192.168.12.3:<vol>  /mnt/vol  glusterfs  defaults,_netdev  0 0 +        * All raid volumes be also temporarily commented in /etc/fstab  +        * Further configuration changes required to run node without glusterfs causing no damage to the rest of the system +            GlusterFS might be referenced via: /etc/hosts, /etc/fstab, /etc/systemd/system/*.mount /etc/auto.*, scripts/cron +                endpoints (per namespace), inline gluster volumes in PV (gloabl),  +                gluster-block endpoints / tcmu gateway list, sc (heketi storageclass) and controllers (ds,deploy,sts); just in case check heketi cm/secrets),  +    - Plan: +        * Prepare application node [double-check before implementing] +            > Adjust /etc/fstab and check systemd based mounts. Shall we do soemth with hosts? +            > Check/change cron & monitoring scipts +            > Adjust node label and edit 'gfs' endpoints in all namespaces. +            > Check glusterblock/heketi, stange pvs.  +            > Google above other possible culprits. +            > Boot ipekatrin1 and check that all is fine +        * cronJobs +            > Set affinity to ipekatrin1.  +            > Restart cronJobs (maybe reduce intervals) +        * ToDo +            > Ideally eliminating cronJobs all together for rest of KaaS1 life-time and replacing with continuously running cron daemon iside container +            > Rebuild ipekatrinbackupserv1 as new gluster node (using disks) and try connecting it to the cluster + + Hardware + -------- + 2024 +    - ipekatrin1: Replaced disk in section 9. LSI software reports all is OK, but hardware led indicates a error (red). Probably indicator is broken. + + 2025.09 (early month) +    - ipekatrin1: Replaced 3 disks (don't remeber slots). two of them was already once replaced. +    - Ordered spare disks + 2025.10.23 +    - ipekatrin1:  +        * Replaced RAID controller. Make attempt to rebuild, but disks are disconnected after about 30-40 minutes (recovered after shutoff, not reboot) +        * Checked power issues: cabling bypassing PSU and monitoring voltages (12V system should not go bellow 11.9V). No change, voltages seemed fine. +        * Checked cabling issues disconnecting first one cable and then another (supported mode, single cable connects all disks). No change +        * Tried to imrpove cooling, setting fan speeds to maximum (kept) and even temporarily installing external cooler. Radiators were cool, also checked reported temperatures. No change, still goes down in 30-40 minutes. +        * Suspect backplane problems. The radiators were quite hot before adjusting cooling. Seems known stability problems due to bad signal management in firmware if overheated. Firmware updates are suggest to stabilize. +        * No support by SuperMicro. Queried Tootlec about possibility of getting firmware update or/and ordering backplane [Order RG_014523_001_Chilingaryan form 16.12.2016, Angebot 14.10, Contract: 28.11] +          Hardware: Chassis CSE-846BE2C-R1K28B, Backplan BPN-SAS3-846EL2), 2x MCX353A-FCB ConnectX-3 VPI +        * KATRINBackupServ1 (3-years older) has backplane with enough bays to mount disks. We still need to be able to put Raid-card and Mellanox ConnectX-3 board/boards with 2 ports (can leave with 1). +    - ipekatrin2: Noticed and cleared RAID alarm attributed to the battery subsystem.  +        * No apparent problems at the moment. Temperatures are all in order. Battery reports healthy. Systems works as usual. +        * Setup temperature monitoring of RAID card, currently 76-77C +   Software  --------   2023.06.13      - Instructed MySQL slave to ignore 1062 errors as well (I have skipped a few manually, but errors appeared non-stop)      - Also ADEI-KATRIN pod got stuck. Pod was running, but apache was stuck and not replying. This caused POD state to report 'not-ready' but for some reason it was still 'live' and pod was not restarted. -     -    
\ No newline at end of file | 
