Server Fail Over Weekend
Turned out to be nothing, but it took two days to resolve.
Over this last weekend, there were some hiccups on my WiFi. I've got my ISP router, on which I've turned off their unsatisfactory WiFi, attached to my good and powerful WiFi router. Occasionally, a conflict will occur, where the ISP router will get cranky, and the interaction between it and the WiFi router will cause the WiFi router to become CPU pegged. It takes a couple of tries of resetting both of them to get them to play nice.
This happend on Sunday night, while trying to stream some football, no less.
When I reset the ISP router, the servers (this one, in particular) attached to it go off-line. This generates an alert from my Internet-based up-time monitor. Moments later when the router finishes booting and the server returns, I get an alert letting me know the Internet can again see my server. This happened a handful of times as I turned off-and-on the devices. All was eventually fine.
Later, after going to bed, this server went off-line by itself. Another alert was generated. In the morning, I cleared the bevy of alerts and thought nothing of it. Later, I got another alert, but this time the seldom seen "off-line for hours" alert. I checked, and while I could reach the network and some servers, I couldn't reach this one. When I got the first chance to sit at the server, the keyboard wouldn't do anything on the screen, the curor waiting to accept the login name wasn't blinking, and the caps-lock and num-lock toggle tests failed. The server was unreachable from the local network by ping or SSH.
So I shut it down. First I did a kind APC kill (bonk the power button). Normally, the server springs into action stopping services and scrolling text on the screen. I tried again, thinking maybe I hit the button wrong. I also tried the reset button, which really does the same thing but doesn't turn the server off at the end. Nothing worked. I figured whatever horrible thing went wrong, the CPUs were probably pegged or otherwise ignoring the interrupts, maybe the kernel crashed without a panic or whatever. So I long-pressed the power button and was rewarded with fan silence and a blank screen after a few seconds.
I turned it back on. It failed to POST. Firmware checksum failure and floppy disk failure messages, with an invitation to hit F1 to continue or DEL to enter the setup. There is no floppy disk on the system (haven't had a system with a floppy for a dozen or more years), and it had previously been disabled. I also thought there were two hard drives and a DVD in the system (former desktop, otherwise no need for a DVD), so I thought maybe a hard disk had failed and that caused all of the problems. I shrugged and figured a trip through the BIOS would maybe fix it, and give me a chance to try to get it to search for the missing disk. The keyboard still did nothing. I pulled its plug and tried a different USB port. Nothing. I tried several keyboards in several USB ports. Nothing. Now I figured the BIOS is borked, and the machine is likely scrapmetal now.
I turned to the new server. There's a post about the 128GB RAM machine with 8 quad-core CPUs that I have sitting idle while I try to re-imagine my server infrastructure. Kids, work, and a cluttered office steal all of that time and will.
Twice a day I have a time-machine-like backup that occurs from my servers. Fancy rsync and some compression keeps a week long backup, with first-of-the month milestones going back for a year. I checked. The mail and web server data are all there. The mail and some other configurations are also there. The web server is a problem, though. It is configured to back-up the old from-source web server configuration. I failed to add the new server configurations to the backups with my efforts to move to package-based installation, though. More heavy sighing. It was a several night effort to migrate all of those configurations. So I started over.
I copied a bunch of data to the new server. I copied a bunch of old config and started to make new configurations out of them. I paused and said "for disaster recovery, let's go even simpler," and installed the old version of the web server so I could use the old configuration files. I installed the old server from packages, although I did have the source and config files from that and could have done it by building. I tweaked the new copies of the old configuration to get the differences in the package installation to work with my new paths.
It wasn't the way I wanted to do it, with the web server installed at the root of the machine. I wanted to put the main web server in a container, turning it into a proxy, and separate the websites and Tomcat and other packaged web apps to different containers. I had all kinds of visions of the same kind of separation and portability and scalablilty that I do at work every day...where I get paid to have the time to do it. Instead, I did the dirty deed and installed the web app on the server, put Java and Tomcat and PHP and all of the rest there, too. It made me sad, but would work.
Then I tried to add a second network port to the server. The old server is multi-homed. One NIC connects to the LAN to talk to the other servers, where the database and some storage, and the backup server all live. The other NIC connects directly to the ISP router (which has some DOS and simple firewalling, and has much more powerful firewalling at the OS) and gets a static Internet address. Both of the routers forward HTTP and HTTPS ports to other servers (both to protect the routers but also to leverage my small IP pool), so I needed to apply the old server's IP to my new server, and I needed to do the same routing tricks so that it could know to originate traffic from the Internet IP and not the NAT LAN IP.
Addressing was easy enough, but I couldn't get the traffic to flow right. I could establish a connection to the server from outside my network (testing from by using my phone as a hotspot), but I couldn't get the web server to bind or respond or whatever. Also, any traffic I tried to initiate from the server always went through the LAN. I spent hours trying to get the routing to work. A big difference between the old and new server was the OS, because I thought I'd choose an OS a little more close to containerization than the other one was running. To be fair, they both support the containerization I was planning to use, but the one seems to create smaller containers.
I gave up, created a new bootable USB with the old server's OS, and installed it on the new server. I successfully addressed the machine, and had the backed-up configuration, so except for a different LAN IP, the server was just as solidly routing and firewalled. Then I face-palmed because in my rush I forgot to save the tweaked configurations. So I went to bed...with another heavy sigh.
The next day I pondered and researched and made a few script tweaks connected from my tablet from work between meetings. I copied (a little less carefully) all of the last site backups, and prepared for recreating the web server as soon as I got home. So all of the data was on the new server. In a little bit better storage set-up, and with a little different mind toward leveraging that from a container, although I had still installed the web server on the OS. This time, though, the drives on the server were partitioned so I could (when I found time) replace the main partition and create a containerized deployment, but leave the data safe...
When I got home, I still hated the decision, and decided maybe what I should do is pull the drives from the broken server, plug them into another server, and then I could read the configurations. I'll make a fat container with all of the web servers inside of it (not my favorite solution), and then can move forward more rapidly. So I unpluged broken server (and accidentally unplugged my workstation, but it was no worse for wear when it rebooted), and opened the side panel to pull the drives. Visual inspecion showed nothing dangling or burnt, so there's no obvious physical reason it shouldn't work. I was surprised to see just the one drive in there. I dug and found some notes, and sure enough, it's just a big, fat, couple-of-TB drive with some separate partitions (so I can replace the OS without losing the configuration or data).
Before taking it apart, I thought I'd give it one more whirl with the power. I dug up some power and moved video cords to reach the ports with the server on the top of the desk, plugged the keyboard into the USB ports on the back instead of via the cable to the front, and turned it on. Same interaction. I moved the keyboard to different USB ports (there are six on the back), but none mattered. I thought maybe one more test was warranted, now that the machine was out of its hole and the back was accessible. I scrounged and found an old, crusty keyboard with an old PS2 connector. That one worked! Whatever was borked in the BIOS must have stopped it before it could talk USB. I tripped through the BIOS settings, seeing nothing wrong, re-disabled the floppy disk, and rebooted the server.
It was back!
A flurry of journal messages flew by. It recognized the disk had been unhappily disengaged and encouraged a diagnostic run. I rebooted again and let it check and fix the disk. It restarted and started starting services. It made it to the login prompt, so I logged in. I checked the things I check, and everything was fine, except the WAN adapter wasn't completely addressed. Of course! I'd given its IP to the other server and this one dropped the address in conflict. I told the server to reboot again, and ran downstairs and unceremoniously disconnected the cable from the router to the new server. By the time I returned, the server was back. i checked again, and everything was fine. Traffic was pouring into the machine as the paused e-mail from the world started flooding in (about 100K spam messages eaten a day...the downside to having the same e-mail address since 1995...), and web hits started up again, first with the robots (they never stop), and then with the occasional hack garbage (there is no admin.php at the root of my servers, guys), and then the occasional real traffic. I tried from my phone, all worked well.
I've left the new server unplugged, although I did readdress its Internet port. I probably should have given it its own to begin with, but I didn't want to monkey with DNS for the sites and have propagation on top of what amounted to a 40 hour absence. The old server is still inconveniently located on my desk, with its side panel off and cords dragged across the room. I figure in the next opportunity I have for maintenance that I'll power-down and re-close the old server, relocate it with the others in the basement, and turn it back on after what should be a few-minute outage.
Then I'll return to and complete the containerization efforts.