Systems Nearly Normal
There was just a bit of a thunderstorm last night. Twice as we were laying down for bed the lights flickered off and back on for a second or two; long enough for the smoke detectors to warn they'd lost power and spook the dogs. I cringed thinking of the computers usually left on, especially the web/mail/database/app server in the basement.
I didn't race down to check on anything. The laptops, of course, have their own batteries. The workstations are configured to just stay off. The "servers" are on UPSs, and should have been able to withstand the minor outage.
In the morning, I booted the workstations in the office and then went to the basement to check on the server. It was still on, as I expected, but when I powered the monitor, I was greeted with an unwanted tiny bit of text:
Starting up ...
No keystrokes seemed to have effect. I rebooted. Same prompt. I rebooted again and triggered the GRUB menu, picked the recovery mode option, and got the same prompt. Overdue for a day at the client's, I shrugged and figured I'd just bang on it later. I powered the machine of and left for the day.
On the drive I chastised myself for not having a better back-up and for not completing the migration to the new server. I also went through the checklist of things to work on: a dozen websites, mail service for those domains, databases beneath the sites...I cringed anticipating a headache ahead.
Things were ground to a halt at the clients as they waited for their service provider and DBAs to correct the databases they'd corrupted overnight, so I thought I'd do what I could remotely. Unfortunately, I couldn't reach either of the should-be available workstations. I thought first it was due to SSH filtering, but nothing could connect.
I zipped home at lunch, reset the router, double-checked it worked from the workstations, and returned to the office.
Able to connect, I set to rearranging the DNS to point to the new server, I double-checked that the servers were at least installed, if not running right. Shortly after, work resumed with a return of their database. The day ground down, and I returned home with a mental list of things to correct.
The first thing I did after returning home was to try again booting the server. Still no go. I decided to see if I could at least access the server's disk by booting with an Ubuntu Live USB. The disk was still there I did a quick SMART disk check from the disk tool, and ran an fsck for good measure. I rebooted, hoping that would do the trick. Nope, still broken.
I rebooted to the Live USB again and started copying files to the bigger workstation. The big workstation has a terabyte drive, so it can handle all of the 250GB drive in the failing server. I started by copying the websites and configuration files and other important bits.
While the copies (still) run, I set to restoring the databases, and making sure that the Apache and Tomcat are working. I had a little trouble with PHP (which is running this blog), but worked that out (hopefully with JPEG encoding, so that CAPTCHA plug-in will finally work). The Tomcat took a little tweaking as the JDK was in a different place (symlink, really). After just a few hours of poking, the servers are running again. The file copies aren't done, but the databases have been restored, and the servers are reconfigured.
Now I just need to wait for DNS propagation, confirm Sendmail is behaving, reconfigure SpamAssassin...list about half done...