Saturday, January 13, 2018

Proxmox postmortem

A list of failures:

  • No reliable monitoring of server, and equipment.
  • Using an USB drive as main primary drive.
  • Not backing up said USB drive.
  • Not backing up said drive when disk symptoms occurred. (I/O issues were a hint that disk failures are imminent.)
  • Not having automated deployment process.
  • Having data behind (possibly?) a single point of failure. If recovery of data is possible, will have to be postponed for a while.

Last week, I noticed that my basement server was down. I attempted to reboot to no avail. Plugging in a monitor, I noticed that the boot up disk had issues. By chance, I plugged in the USB disk into another slot, and the server booted up the OS with the proper output.

Mission accomplished. /s

I was (innocently) happy at the easy resolution, and went to my basement computer to kick off some patches on the server, and encountered my disk encryption prompt. [1]

After logging into my client computer, I SSH'd to my server and manually kicked off an update. The update ran for a while, and ended up with the disk in a read only state. At this point, I was likely screwed regardless, and won't go into further details. (as I don't know how to migrate good partitions onto a new bootable disk; though it seems to be a worthwhile endeavor)

There are a few problems, including not having backups of my primary disk. A better approach would have been automating the full deployment process. There was no version control in any step of the image deployment, and installation processes. 

These issues will be corrected in the near future.