Server Struggles After Power Flicker
This afternoon we had a little bit of a power outage. The whole neighborhood went dark just as dinner was finishing. There was enough light to finish eating, and we gathered around a cell-phone to watch Shrek on Netflix and listened to the rain outside.
It's remarkable how quiet everything is when the power is out.
One thing that wasn't quiet was the chirping of the UPSs in the basement. There are three 1500VA UPSs attached to some of the big servers' power supplies, so they can carry on after the power drops. That's the idea, anyway. I let it ride for a few minutes, helping to get the kids started on dinner. Then I tucked downstairs to politely power off the servers since it was apparent the power wasn't coming back soon.
There are two "big" servers, a usually powered-off LCD display, and the network gear plugged into the UPSs. The big server has 4 950W (peak) PSUs, but only requires one to run (it does require all four have power to boot), so only one is plugged into the battery of the UPS. The little server has 2 200W PSUs, and also only requires one to run (and only one to boot), so I only have one of those plugged into the battery of the other UPS. The monitor and a network switch are also plugged into the battery of the UPS with the smaller server. All of the other PSUs are plugged into the surge protection ports on the UPSs. The third UPS used to have one of the other big server's PSUs plugged into it, but because of cables everywhere, it now only has the DSL router and my WiFi/firewall.
When I trekked down there, one of the UPSs was already out of juice and had shut down. This, of course, was the one that the big server was plugged into. Immediately my plan to shut the servers down neatly was foiled. The other PSU, with the smaller server attached, reported it had about 15 minutes remaining, so it was clearly not draining very fast. That little server hosts the database engines and the big file storage, with its flurry of ZFS drives giving the best chance of surviving the impolite shutdown. The plan is to use the big server to SSH to the little server, shutting it down nicely first, and then shut down the big server, and finally stop the UPS alarms. The servers are configured (in theory) to turn on and boot again automatically when the power returns.
A couple hours later, the power returned.
The littlest server, a hand-me-down and then rebuilt workstation still sitting under my desk in the office, started just fine. Unsettling, my main desktop, sitting next to that server under my desk in the office, did not. I bonked its power button and let it boot; I'll look at that later.
I used a tablet to SSH to the littlest server, and from there to the ILOM server on the little big server. I told it to start, which it did, and it did correctly boot the OS when the system finished starting. From the ILOM connected to the console (it's like you're sitting in front of it, except this server is headless, so it's all virtual), I logged into the OS and started the database zone (I can't find how to make that automatically start when the server restarts...maybe when I get some time...).
I couldn't SSH to the ILOM on the big big server, though. Something in some recent OS update on my other systems has messed with the SSH ciphers and whatnot, so now the old, old, old ILOM system is trying to use something the new OSs have determined is insecure. Likewise, I can't get to the ILOM web interfaces any more, because the version of TLS they support isn't supported by the browser, never mind that I can't get them to accept a modern SSL certificate, so the CA that they're using is 10-years out of date. So I went back downstairs to work on its screen and keyboard.
The big big server had booted just fine, but was reporting that one of the disks couldn't be mounted. Grr. This is the disk that had given me some guff last year. One that I had intended on replacing when I had a chance. I still haven't made the time. I did the same thing I'd done before, booting with a USB version of the OS and used the disk tool there to repair the disk, but that didn't work. I ultimately ended up editing that drive out of the fstab, so it's just plugged in, but not uselessly mounted. Then the server booted just fine.
I popped back to my desk, and was now able to SSH into the OS running on the server. A quick scan of the system, and all seemed to be running just fine, even without that failing-even-more drive mounted. This blog is running on the big big server, fronted by a web server on the little server, using database hosted on the little big server. That I can look up the old blog entry and make this one shows that all is well again.
I've made some post-it notes to work to identify and replace the failed drives in the server. I have the spares. I need to identify which disks they are, pull them out, swap them, plug them in, partition and format them...and then decide if I need to use them at all. The server is leveraging the persistent storage on the other server, so it doesn't really need the storage, yet. But I can go without the frustration of the disk blocking rebooting.
And I need to figure out a better plan with the UPSs. The OS does support the UPSs using the apcupsd package. That package is supposed to be able to let the server recognize the UPS is running on battery, and when it hits a configured level, shut itself down gracefully. This should prevent the problems of the disks being "dirty" when the server reboots. Plus, I should be able to add some triggers in there to use the CDN API to note the domains are in "maintenance mode" instead of showing error pages. And other fun stuff like that.
All I need is time.
Everything seems to be working now, though. It just took some time to get it back.