Posted by Deliverator on March 10th, 2008

As Ryan noted earlier today, Silverfir had an attack of the Hiccups. Ryan noticed what appeared to be some sort of file corruption. We had some notices from Silverfir’s RAID controller, a Compaq Smart Array 5302, a few weeks ago that one of the drives in the 14 disk array (which Silverfir uses to store all the websites it hosts) was having problems. At the time, I didn’t worry worry much, as the array was constructed in such a way that two drives could fail at the same time without data loss. In addition. I designated a drive in each array as a hot-spare. Essentially a hot-spare is a drive which is kept unused in reserve so that if a drive fails, the hot-spare can automatically be brought in as a replacement without the server maintainers needing to so much as lift a finger. Still, given the age of the drives in the array and my own real world experience of hard disk failure rates, I was more than a little worried. I headed over to check on the server in person, while Ryan went to work troubleshooting the problem remotely. By the time I got there, Ryan had already dismounted the array and was running a fsck (a type of filesystem integrity check) on the array. The scan found a number of file system issues, most likely due to power failures during disk writes. Silverfir had its own UPS, but it was recently removed do to a dead battery. We never set up UPS monitoring/management agents, so all the UPS was doing was preventing the briefest browouts and power outages from causing problems. Thankfully, Silverfir uses a journaling file system and the file system check was able to repair all the issues with little to no apparent data loss.

On my end, I found no idiot lights or other indications of actual drive failure. None the less, I went to work tarring up important (to me at least) data directories and transferring them over the network to my laptop. While I do occasionally back up my WordPress database, user account home directory and other assorted files, I don’t do it nearly as often as I should or in a real systematic fashion. Silverfir’s users are largely left to back up their own damn data themselves, thank you very much! This is a nice sentiment in theory, but the lack of a locally available backup drive or of a fat pipe for internet based backup has made this problematic.

Silverfir has largely been a stompbox for Ryan and I, and as a result hasn’t exactly received the same diligent administrative attention that we would apply to a “real” server. We have largely just let it coast. I think we are both reaching the point where a little systematic backup and other prophylactic measures would give us both some peace of mind. Towards that end, I installed a PCI->USB 2 adapter card to allow an external hard disk to be hooked up for backup purposes. I am going to purchase a large enough backup drive that we should be able to take quite a few backup points before needing to delete intermediate data points. I will also likely donate a UPS and some other hardware for the good of the order.

Long term, I would like to get Silverfir on some more modern, compact, lower power, quieter server hardware; achieving as much reliability through hardware as possible. I think Ryan on the other hand would just assume move Silverfir onto commodity desktop hardware and achieve peace of mind through virtualization and/or rigorous backup. I think both approaches have their merits, so at some point we will need to have a discussion. For now, I think we are both happy to let the server go back to coasting (once we get the backup drive set up).