The Frowning Pony
Administrator
13 blinks per minute
For those wanting a tl;dr of wtf happened, have some bullet points:
- Server handling arbitration, DNS and backups dies
- Lack of proper arbitration causes data desync, hidden images
- Backups temporarily set to a separate partition on cadance, the image server
- New backup server goes online, but needs setting up
- Datacenter where cadance lives suffers a network fault
- cadance does not come back after network fault is resolved
- Per host’s intervention guidelines, server is rebooted to ensure it’s not a software issue
- Reboot triggers a file system check, as expected
- Host tech notices a trivial fault with a remote control module and reboots the system during the check causing data corruption
- After 8 hours on the phone with them insisting the machine was in perfect health and them looking at a login screen, we pay for KVM IP to have direct access to the machine
- Machine is seen to obviously never have gotten anywhere near login; file system corrupted
- Due to the gap in the backups, we begin very slow data recovery
- To speed up the process, we order a USB drive for the server
- After 18 hours, host plugs the drive in and reboots the server mid-data recovery, severely impacting the process
- All data possible recovered between cadance and main backup, site goes back up
- You are here