DB Server Host Hiccups
Something you don't want to see when you tell your system to boot:
{0} ok boot
ERROR: boot-read fail
This usually means a hard disk failure, which is pretty bad. In my case, it means I forgot that I can't soft-reset this server.
Earlier, my server monitor started noting that this web server was down. I checked, it wasn't responding. I restarted the web server, and everything returned to normal. A few hours later, it reported as down again. I checked, again, it wasn't responding. I restarted the web server and poked around a little more. Seems the problem wasn't with the web server, but the DB server on which this blog and other software rely. The problem with the web server was that a status page I've created to check the health of the database connection was running "forever," and eventually consumed all of the available web server threads, causing it to die.
So I set to restart the DB server. First I couldn't connect to its web interface for a peek. Then I couldn't connect to it via SSH. I could connect to the server's management port, both web and SSH, but couldn't log in from the management port to the running OS, either. So I told it to do a graceful restart in the web UI. After a bit, the management port returned, giving the {0} ok
prompt, to which I responded boot
. It buried the "boot-read fail" message in a flurry of attempts to get an IP address. This flurry sent me down a path of checking the DHCP service on the router, restarting that (for no reason), and eventually scrolling back to the beginning of the flurry to see the message.
Face palm. I've done this before. I thought I made note of it here, but must have just buried the memory.
I told the management system to stop the server. Then I told it to start the server. Connecting to the OS console, it happily showed all of the POST and diagnostic messages, and eventually the same {0} ok
prompt. I did a quick peek at some of the set-up, saw that it could indeed see all eight hard disks, and so I told it to boot again. Immediately it started with the OS splash and the few messages in its boot flurry before inviting me to log in. I logged in, and peeked at some logs. Nothing meaningful. It seems to have just stopped responding around 6PM tonight. I hate it when that happens.
I started the database server, and it complained that it wasn't able to achieve a lock. Possibly something left over from the harsh shutdown. I "stopped" the database server, and after a moment of thinking (where it probably validated the locks and removed them), it told me it was already stopped. I started it again, and it came up fine. A quick connect and check, and nothing seems amiss.
Beneath the DB is both a heavily journaling ZFS, but also mirrored hard disks. It shouldn't have a problem recovering. Fingers crossed, now that I've issued that challenge.
I still mean to add another DB server, and join them in a cluster, so that should this happen again, the other can take over. I just need to find the time to do it, now that the big server has additional storage.