Ubuntu 16.04 Upgrade
Ubuntu OS 16.04, named Xenial Xerus, was released.
Even before the release, as the release candidate was made available, I updated my VMs and eventually my main home workstation. I poked at the various things that were called out as changes, and made sure things continued working. It seemed pretty slick, in that nothing seemed to break on the desktop or the test suites I have in the VMs. I decided to upgrade the server next.
In the past I have had issues when upgrading as it will impact software with customized configuration files. Some software read only from one main configuration file, and so any customizations you make need to go in there. Some of the customizations are "trivial," like a server name or IP address. Some are less trivial, like configuration for related and external services. Apache HTTPD server does it well, and most of the customizations are in external files created in a directory the main configuration file reads, so upgrades don't tend to affect those (although the change from 2.2 to 2.4 is a bit more intrusive). Postfix, as another example, is a mix, where some customizations are in separate files, but some are in maintained files. Usually the installer is good at recognizing these, but I've become prepared to fight through them.
Another issue I have is with some of the built-from-source software I use, rather than using the provided packages. Sometimes this is for historic reasons (like I've always done it that way), sometimes its for specific compilation choices (which may have library options, but I haven't sought or found them), sometimes it's because the software isn't supported by the maintainers (usually because of license disagreement or age of the project).
I dutifully checked back-ups and threw some bits into maintenance mode or stopped servers to avoid complications as software was installed. I fired up the updater, answered the prompts I expected, and left my office to have dinner. I returned occasionally to find some additional prompt to merge or overwrite an expected configuration clash, and eventually to hit the "restart now" button at the end.
As has happened, the GUI (it's a retired desktop turned server) didn't load. I recompiled and installed the (for licensing reasons) not-maintained video driver, but still it wouldn't load. Only a little bothersome (the system sits logged in showing the performance montior, but otherwise the GUI isn't really used) as I was ready to do the rest via SSH from the desktop anyway.
I started looking for the things that didn't start or finish installing correctly. HTTPD (compiled from source for historic and choice reasons ) had failed with a seg fault; I recompiled it and some of the (also from source, because you can't blend the libs with the not-maintained version) modules I use, like ModSecurity and PHP and others. It finally worked. One of the Tomcat servers didn't start; its start-up script curiously referenced an old and removed version of the JDK--path fixed, it started. Oddly, the SpamAssassin mail checker was failing to start; I re-ran the PERL-based installer and it still failed, so I went with the package version, and it seems to be running (I've not yet re-checked the system configs, and since it didn't complain about conflicts, either it didn't care or it's using a different config file).
Two troublesome beasts were that MySQL wouldn't complete its configuration and start, and the SMTP auth that I use for allowing relaying was failing every request. Nailed down to just these, and the still failing GUI, I started working on the MySQL as the SMTP auth uses MySQL, I figured that might be the root cause of those failures.
MySQL is one of those that uses many different configuration files. If you have (as I do) an /etc/my.cnf, it should be the one loaded, overriding the one in /etc/mysql/blah. Alas, it seems that even though the docs and server would say this was the case, it didn't seem to really be the case. I would replace the symlink installed with the app with my configuration file, but the package manager kept replacing my file with the symlink. Eventually I jus replaced the contents of the symlinked file with mine, and that seemed to work.
That fixed, I expected more of the server to behave, and for the most part it did. The database using apps mostly started working. A handful did fail, like this blog. The update changed the version of the MySQL server, and it resulted in a little more strict query language. I didn't know, but this app was using a bad query that the old version of MySQL forgave. So I spent some time and updated this software on all of the sites that use it. That took too long, too, but finally worked.
I then turned my attention to the SMTP auth. It was failing because of a mssing method in the library used to query databases for user credentials. Digging around showed that no one liked that the project had been stale for about six years. I think someone forgot a "b" because it was pretty stable software. So I poked around and found that the only alternative that used a database (or LDAP) required passwords to be stored in plain text. WTH? So I spent some time tweaking the query so that I could do an on-the-fly decryption; this also meant I had to encrypt the passwords when receiving them from the management app I'd written. Previously I was using the DB's PASSWORD function, which worked with the auth library. Finally got that fixed (and a long e-mail sent out).
I checked most of the other bits and pieces I directly use, and all looked well. I whould have stopped, but the failing-to-start UI was bugging me.
I found an article that suggested using the tasksel
app to de-select the desktop environment. The reviews and comment seemed OK, and a few other articles that I found when searching for that example also gave credibility to the solution. I cringed, but sadly didn't turn back, when the list of things to be uninstalled scrolled off my screen. Half or more of the apps, it seemed, were to be removed. I mulled this over, and recollected some of the comments about this, but they were touted as good things. I estimated in my head the chain of dependencies that would be necessary for running a graphical desktop and hit the enter button.
An hour later the machine sat at a prompt. I nervously poked around. The services I had left running were seemingly running. The log scroll I usually see was scrolling. I felt a little more relief. The server memory use had dropped by half, and the CPU was weirdly idle.
I tested by stopping a service and restarting it. It worked. Then another. It didn't work. I tried the mail server, and it failed. I tried the web server, and it failed...now I was a little nervous. I ran some "fixes" by re-installing the apps that weren't working, and some packages extra dependencies got added back. I looked at some of the dependencies for my from-source stuff, and added some of those. Those services came back to life. I could stop an restart my web, app, database, mail, and authenitication services. The intrusion detection would restart and continued logging. I felt a little better. Dismayed that the app would remove some of the dependencies for things not in the GUI, and that didn't add GUI parts back, but chalked it up to upgrading after upgrading...
Then I crossed my fingers and did the prompted-for reboot.
Of course, I was rewarded with the list of versions to choose from, which was nice, but when I selected the new one, a BusyBox prompt I'd never seen before. It seems to be the boot loader's default shell if it can't find some dependencies. It was screaming about not finding the drives by UUID. A little research suggested rebooting and picking an earlier version or the recovery mode version. None of these things worked.
I had to build a recovery CD. I tried editing the boot menu to use devices (/dev/sda1) instead of UUIDs. Didn't help. The recovery disk kept failing at reinstalling the boot loader (its favorite task, it seems), so I skipped past that to the partition manager. I found none of the partitions had mount point labels. I checked the fstab and they were there. So I used the partition editor to add them back. There was a moment of pause when it said it would replace the contents of system folders, so I took a moment and backed up a bunch of stuff. It did indeed mess with those folders.
I was so pissed. Some at myself for not heeding the warning, also for not having deeper back-ups, and for not finding the non-destructive tips before the destructive ones did their damage. I don't have any non-destructive tips to share, sadly. Just a lesson in pulling configuration out of my ears.
It's now two days later, and the server is back. Ultimately I ended up reformatting that partition and installing the server from scratch. It took a while to get the packages all added back, and even longer for the built-from-source stuff. The other partitions held the bulk of the data for the system, but /var/spool/mail went, and wasn't backed up, as did /var/lib/mysql which contained the DBs, although I have frequent dumps I can restore. There are surely other helpful scripts I haven't got any more that I'm not sure if I need, or if I need them I'm not sure I can recreate them. I'm digging through the backups next to see what I can salvage, if I need it. Crossing my fingers that I can reproduce what I need and don't have.