Hardware -------- 2024 - ipekatrin1: Replaced disk in section 9. LSI software reports all is OK, but hardware led indicates a error (red). Probably indicator is broken. 2025.09 (early month) - ipekatrin2: Replaced 3 disks (don't remeber slots). two of them was already once replaced. - Ordered spare disks 2025.10.23 - ipekatrin2: Noticed and cleared RAID alarm attributed to the battery subsystem. * No apparent problems at the moment. Temperatures are all in order. Battery reports healthy. Systems works as usual. 2025.09.28 - 2025.11.03 - ipekatrin1: Raid controller failed. The system was not running stable after replacement (disk disconnect after 20-30m operation) - ipekatrin1: Temporarily converted in the master-only node (apps scheduling disabled, glusterfs stopped) - ipekatrin1: New disks (from ipekatrinbackupserv1) were assembled in the RAID, assembled in gluster, and manual (file walk-trough) healing is executed. Expected to take about 2-3 weeks (about 2TB per day rate). No LVM configured, direct mount. - Application node will be recovered once we replace system SSDs with larger ones (as there currently no space for images/containers) and I don't want to put it on new RAID. - Original disks from ipekatrin1 are assembled in ipekatrinbackupserv1. Disconnect problem preserve as some disks stop answerin SENSE queries and backplane restarts a whole bunch of 10 disks. Anyway, all disks are accessible in JBOD mode and can be copied. * XFS fs is severely damaged and needs reapirs. I tried accessing some files via xfs debugger, it worked. So, directory structure and file content is, at least partially, are good and repair should be possible. * If recovery would be necessary: buy 24 new disks, copy one-by-one, assemble in RAID, recover FS. 2025.12.08 - Copied ipekatrin1 system SSDs to new 4TB drives and reinstalled in the server (only 2TB is used due to MBR limitations) Software -------- 2023.06.13 - Instructed MySQL slave to ignore 1062 errors as well (I have skipped a few manually, but errors appeared non-stop) - Also ADEI-KATRIN pod got stuck. Pod was running, but apache was stuck and not replying. This caused POD state to report 'not-ready' but for some reason it was still 'live' and pod was not restarted. 2025.09.28 - Restarted degraded GlusterFS nodes and make them work on remaining 2 nodes (1 replica + metadata for most of our storage needs). - Turned out 'database' volume is created in Raid-0 mode and it used backend for KDB database. So, data is gone. - Recovered KDB database from backups and moved it to glusterfs/openshift volume. Nothing left on 'database' volume. Can be turned off. 2025.09.28 - 2025.11.03 - GlusterFS endpoints temporarily changed to use only ipekatrin2 (see details in dedicated logs) - Heketi and gluster-blockd were disabled and will be not available further. Existing heketi volumes preserved. 2025.12.09 - Renabled scheduling on ipekatrin1.. - Manually run 'adei-clean' on katrin & darwin, but keep 'cron' scripts stopped for now. - Restored configs: fstab restored, */gfs endpoints. Heketi/gluster-block stays disabled. No other system changes. - ToDo: Re-enable 'cron' scripts if we decide to keep system running in parallel with KaaS2.