One of the big server disks started barking every few seconds. No noise, really, but a flood of messages on the screen, if one happened to look at it.
When watching the logs, or if happening to do something at the TTY console, a flurry of messages like this one would appear every few seconds:
Jan 12 20:23:41 big128 kernel: [57741.437257] blk_update_request: critical medium error, dev sdb, sector 89868336 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
An odd bad sector is to be expected periodically. But there shouldn't be a steady flow of such alerts. Surely, something is failing on that disk. Sadly, that disk is where I'd mounted
/var, which is where all of the containers write their configs, store their data, and Docker maintains its cache of downloaded images. That means it would be inevitable that something would fail, and possibly badly.
Thankfully, I'd received the 146GB replacements for the 300GB drives that the server can't use. The server has hot-swappable drives, and I've unplugged and re-inserted the 300GB drives, but since the firmware can't talk with them, the OS doesn't care. I've done this on other servers, where the disks were part of a RAID configuration, and the BIOS added the new drives to the RAID without the OS caring, too. That's not the configuration I have here. I had never hot-swapped a hard drive while the OS is running on this server before, so I shut it down.
I put the new disks in the slots and booted the system again. I planned to partition and format the disks while the server continued serving things, and then work out the swap of
/var later. Curiously, that simple plan didn't work.
Every time I tried to
mkfs on one of the drives, I was met with an insufficient size error, and the partition wouldn't work. This was disappointing. The other drive, of the same model, worked brilliantly. After several attempts, I went brutal, and restarted the server with a GUI bootable USB stick. I'd done this in the past to fix the troubled disk, since it couldn't be fixed while mounted, nor could it be unmounted while the system ran because it hosts a critical file system. While in the GUI, I partitioned and formatted the disks. I did both, even though the one was working fine.
Upon reboot, both disks reported the correct size, mounted well, and allowed me to copy the other disk. I used an
rsync command to try to copy the
/var data to the new disk. It worked fine, but barked at a lot of Docker files. I figured they were possibly in use and locked by the running container system, so when I saw a folder with a bunch of failures, I excluded it and ran the
rsync again. Eventually, I got it down to where re-running the
rsync would copy a small number of logs, undoubtedly updated by my activity or regular system actions. I did a poke around a few places I knew to care about, and all looked well. I stopped the Docker daemon, and ran the
rsync without the exclusions, and many more files in those folders succeeded.
A little investigating revealed that the failures in
/var/lib/docker/overlay2 were cached Docker information, and were likely safe to remove. I ran the
rsync a couple more times until it was just the same kind of invalid copy error, or the logs. I then edited the
/etc/fstab to mount the newly copied partition to
/var, and point the old one to another mount point in case I needed to copy or review things. A few deep breaths, and then I rebooted the server entirely.
It takes a long time to count its RAM. It also takes a long time to skip each NIC attempt to find a PXE server. And it takes some time to not find any drives attached to the fiber card I keep meaning to pull out one of these times I shut it down. Then the screen rolled with all of the OS start messages. I returned to the office to wait for the services to start, so I could SSH back in and see what was up.
It seems to be working now.
With the server none the wiser, the new
/var works just like the old one. None of the same blk_update_request errors on the screen. Even though the same disk is there and mounted, nothing is hitting those sectors at the moment.
I took some time and ran through all of the containers, refreshing settings and fetching a new image for each. This should fill in any concerning parts of that overlay2 folder. Everything turned green on my uptime monitors. All of my quick hits on the domains returned web pages, and the databases returned queries.
I'll keep my eye on it for the rest of the day, but so far, so good.