Server Suffered With Leap Second Failure
After repairing the power supply the other day, all seemed to be going well. Then a little curiosity got me as I wasn't receiving the amount of e-mail I thought I should. I eventually checked on the server (it was up and responding, so my curiosity hadn't reached concern), and found that sendmail had stopped responding due to high CPU load.
I found and fixed this, so that sendmail would wait until a higher load than 12% (the default) before it stopped responding. I then discovered the MySQL database and web applications were running much higher than normal. Normally it's hard to catch a spike with the web apps, where they'll sometimes hit a percent or two in the cycles of "top," usually hovering around zero percent... The database spikes are a little longer, but again don't last more than a cycle or two on "top." Alas, yesterday the CPU use for the database was bouncing off 50%, which is to say it was taking 100% of one of the server's cores (it is just dual-core). The web app servers were all hovering around 8%, except one which was around 40%.
My first thought was to blame the one busy app, thinking it was hammering the database, which was then causing the other apps to suffer. I cycled the web app, then all of the web apps, then the web apps and database, and finally the server. The problem didn't subside.
My fix to the mail server seemed to have worked, though, as mail started flowing. I shrugged it off and hoped it would resolve after the database finished reindexing or whatever long task it had going on.
I returned today to find that the problem persisted. I had a nagging suspicion that there might be some relation to the leap second bug that took out Amazon the other day (which then took out Netflix, Pinterest, and a whole slew of other big apps). In my searching I came upon at least one article that described the problem with high CPU use by MySQL to be caused by the leap second bug.
The solution was trivial; set the time.
I re-stopped the services, and upon re-starting them, the CPU settled into a closer to "normal" 30%. Instead of wavering at all, though, the server seemed nailed to 30%. I restarted the server thinking that there might be at least one other affected task, and the server restarted and showed it's normal "pulse" to about 20%. This is because the server runs a GUI desktop, with a GUI resource monitor running, and those apps take that chunk of CPU. (Yeah, I know, turn the GUI off...)
The network use seemed high, but upon investigation, it turned out that sendmail had not been behaving well afterall, and the queue quickly rose to about 5,000 messages (it handles about 50K messages a day, so it usually takes a couple hours for 5K messages to be handled). In no time sendmail and SpamAssassin ripped through them, and the queue is down to less than 1K messages.
Except for the remaining sendmail purge, the server seems settled into it's normal pops and spikes as traffic arrives to one of the web servers or when mail arrives.