Recovery Coming Along
I just got this blog running again.
I mentioned previously that the big server was starting to show its age. In particular, I have systems that have trouble SSH-ing into the integrated management server. The OS running on it works fine, but the management server is necessary to boot or shutdown the OS from the outside.
Last week I ran some usual OS updates. One of them required a reboot, which sometimes happens. So I rebooted and waited for the server to return.
And I waited.
And I waited more. I tried restarting it a few times, hoping it was something I might be able to catch in logs or on the screen, but I could not. For whatever reason, once the boot process starts, even before the GRUB menu, the integrated management console goes blank and no longer displays the screen. This used to allow me to log into a TTY interface from the management console, but apparently no more.
Although I couldn't see it, the device would boot, just enough to grab its static IP, but it wasn't getting far enough to do anything more than respond to ping. No SSH, no web pages, no other services or connections. In addition to losing the TTY interfacae, it seems like something in the update mucked with either the firewall or network configuration, or was stopping some critical services from loading.
I created an updated USB thumb-drive with the server OS installer on it. I was able to get it to boot to the offer to install menu. From there I was able to enter the GRUB editor or escape to the very limited shell, and the TTY interface stayed active through the process. I was able to run the memory test and watch it, but any time I selected the other server OS option of "install," the display would disappear. That is where I'd hoped to at least be able to enter the partition editor or at worst install fresh over the old and try to find the configuration file alterations I made, because I mount /home and /var, at least, on other partitions.
I poked a little bit, and found some GRUB entries to try to make to help ensure the TTY interface stays running, but to no avail.
While sitting at the server, I noticed one of the four disks wasn't even lighting up. Perhaps there's a wonky cable or failed drive in there. It would flicker briefly during the initial boot, but never turn on while the OS was trying to run.
I spent several days trying to get this server to boot, and then remembered the sites that it hosts had been offline for over a week at that point. I thought in a quick burst I might be able to get at least the static sites up.
I poked at a few network configuration options and decided to leverage the generous space on the Ryzen 7 server on which I've put my AI playground and where the backup runs. It took a few minutes to get the network configuration to allow the same firewall protections at the WAN firewall that the old server had, and then I exposed the new server via port-forwarding from the WiFi router. The Ryzen 7 only has one NIC, and has been an internal server for a while. I plan to change this in the future, but needed to get it going quickly, and it has a lot of extra RAM and CPU compared to the only other remaining server.
One thing this exposed for me is how much of a crutch I let Portainer get. I run almost everything in Docker (or Podman) containers, at least on the new servers. On the now-replaced server, the only thing running on bare metal was the iptables firewall, and everything else runs in a Docker container. On the Ryzen 7, it's a promoted (or demoted?) former desktop. I've removed the desktop packages, but it's got all my old desktop work on it. It was the backup server even then, because I put the big storage in there. But all of the server bits are running in containers. However, I did that all through the Portainer interface, since it let me connect to all the servers through one Portainer instance. This made creating volumes, networks, and stacks easier than the command-line Docker options. Just easier, not better.
Also, since I'd done everything on the server through Portainer, I didn't have any command line or scripted backups of what it takes to get them running. Even if I had, they probably would have been in the shell history on the now defunct server.
Thankfully, many of the static sites have properly complete Dockerfile files in their GitLab repositories. There were a few modifications, like passwords, that are obfuscated from git, but thankfully they are few in the static sites, which are all single containers. A couple of the dynamic ones, like this blog, use Docker Compose instead, so their configurations are a touch more complex. Thankfully, their configurations are also self-contained, so if I can't find a password, it won't matter, because the only thing that can reach this database, for example, is its partner web site.
I got the static sites running.
First I had to get the "egress" server running. Docker doesn't have an egress idea like Kubernetes does, so I use nginx-proxy to find other containers running on the server, as long as a few Docker environment variables are provided. Giving a container a VIRTUAL_HOST and having it expose port 80 or 443 (and a couple other variables) is enough for the nginx-proxy to add some rules to its configuration to reverse-proxy the requests. I got that running and was quickly greeted by its default Nginx welcome page.
I have a default server that I run, which accepts requests to any domain that reaches the server, but for which there isn't a VIRTUAL_HOST found, and got that running pretty quickly. Now hitting the server returns the failure page instead of the Nginx welcome. If I change a DNS for a domain to point to the new IP, it will be greeted with this now.
I made a routine where I'd hit my DNS server, found a domain (or set of domains served by one site, like how this is where you end up if you use jekewa.com or jekewa.info or other things), found its corresponding container in GitLab, got the container running, and then changed the DNS for the new server.
I forgot about the Let's Encrypt hoops I created for myself. I have one server that runs the Let's Encrypt update every day, which ensures that the certificates are renewed within their pre-expiration window. There are post-update scripts that also run to copy the new certificates to the necessary hosts. There is one that copies to the Ryzen 7, so the AI tools have their own certificates, but the other domains were being copied to the other server. I fixed that and was able to get the static sites that leverage SSL running!
I then put a database server back on the mail server. There's a SASL process running that allows a few people to relay mail through the server, based on credentials in that database. That database had been running on the older retired T5220, but moved to the now retired X4600. It's now running in a container on the mail server. Another database that collects e-mail sent to a SPAM collector was added, too, as there were plenty of errors in the logs about that. Finally, there's a web interface that uses the database to cache message headers as you search and sort through its interface, so I added that. I think there might be one or two more databases that need to return there, but I'll work those out as I go.
For the few sites like this, where there's a web server and database or more, I decided to move them to the bigger server. The mail server is screaming along fine in its quad-core server with 16GB of RAM, but I don't want to overwhelm it with a bunch of little database or web server instances. So I hit their docker-compose.yaml files one at a time to expose the database port to the network, so I could run the data import from the last backup for each database. This worked pretty well. It took a little longer to figure out the reverse-proxy, but that's because one is fronted not by the nginx-proxy but by an Apache server for SSL support. And another, like this one, isn't serving the root or whole website, but just some bits on specific paths.
I think I have that scheme worked out, and only have a couple more sites to finish fixing.
One thing this did reveal was that there was some procrastination in fixing some of the backup things. All of my database backing-up was working for everything in one database. I'd paused it some time ago when I retired that server, and still have a Post-It on my monitor that says "fix db backup" on it. So the data restored is over a month old. I have corrected this so at least the mail server is running again. I fixed the previous script so it would take parameters differently. It works on the mail server since that is exposing the database server ports to the network, since it isn't other containers that need access, but I need to figure out a scheme for the containers that don't expose their database ports.
For this blog, and for the others like it, there are e-mail messages that send the post to moderators. I had been frustrated that I couldn't turn it off, as I'm the only author and moderator, but today I'm thankful that so many of them were sent, as I was able to put most if not all of them back. I guess a gap in blogging has at least that advantage: there weren't too many to make it daunting. The most difficult part was remembering to paste the body copied from the e-mail in the "markup" view and not the WYSIWYG view, so it would lose some of the e-mail highlighting. Also remembering to set the date and time of the post, instead of letting it default to the
I do know I need to remember the cron job I set on the blog servers to run its periodic maintenance. Every minute or hour or whatever, a "docker exec" command would tickle a PHP script in the container and run through whatever maintenance it needs. One of which is sending those moderation e-mails, so I'll have at least that back-up going on.
I also want to develop a little better documentation for getting some of the things running, especially the more complex things. It's all straightforward when it's fresh, but horrible when trying to remember later!
I'm going to finish those backup and documentation bits, and also look at low-power replacements for those big, fat servers I've now turned off.
But, for now, at least this is back.