On Call The week is almost done, and Friday sees the start of the delightful tumble that ends in the weekend. Start your roll with a tale from those unfortunates forced to deal with the foolishness of others in The Register’s On Call feature.
Today’s story comes from a reader we will refer to as “Brad” in order to spare the blushes of those with whom he worked, and dates back to the early part of the century.
Brad was a Unix/Linux administrator at a government agency. “I had a pair of Solaris servers running a source-code repository in a fail-over cluster,” he told us.
In what might bring a tear to the eye of those now dependant on the vagaries of GitHub and its ilk, “that was the easy part,” he said. “They just ran.”
The reluctant log trawler: The buck stops with the back-end
The agency was a huge NetWare shop back in the day and, as was so often the case, used GroupWise as its email platform of choice.
GroupWise, for those now unfamiliar with the granddaddy of collaboration platforms, was WordPerfect’s take on email, calendaring and scheduling before the corporation was snapped up by Novell in the early ’90s and WordPerfect Office slapped with the GroupWise moniker.
WordPerfect would be sold on again, but Novell opted to keep GroupWise for itself.
It usually did not cause headaches for Brad: “Normally, GroupWise ran smoothly, even on single-core processors.”
However, on the day in question, things were not going smoothly and Brad found himself receiving calls for help, or at least calls for explanation. One of his servers was overloading the email system and could he please deal with it? The supervisor of the network team (“whom I seldom saw”) had even become involved.
Armed with the offending server’s name, Brad hurriedly logged on to find out what was happening, but drew a blank as to why it was spewing email like a teen discovering cider for the first time.
“I logged into the server and found it was running at about 50 per cent capacity (hey, it was SPARC hardware, not that crappy Intel stuff NetWare ran on. And I’m pretty sure the SPARC server was dual core).”
Sure enough, sendmail was chewing through prodigious amounts of CPU.
Brad ambled over to the team responsible for the application that ran on the server. It transpired that the gang was performing an upgrade, which had required a tweak to every source file, each of which had to be checked in.
Those managing modern-day pipelines might blanch at this point, as Brad went on: “The application was set up to notify the original poster, and often his/her team, that the artifact was being modified. So for every check-in, there were three to four emails being generated.
“This repository had been in use for years. The hardware itself was probably three to four years old. And we had something like five applications that had their source code stored in the repositories.
“So, a lot of emails.”
It transpired that someone had forgotten to turn off notifications before kicking things off. “They would see if they could turn it off in the middle of the process,” added Brad.
A simple “We’re upgrading!” would have sufficed rather than the GroupWise-choking tsunami that had been unleashed.
Brad stalked back to his desk and killed sendmail with extreme prejudice. “In about five minutes, the GroupWise admins notified me that their systems had returned to normal.”
He neither knew, nor (from what we can tell) really cared if the admins of the application got their act together. He did, however, find 3.5GB of mail waiting in the queue directory.
rm * was Brad’s friend that day and, after a subsequent restart, sendmail was blessedly silent.
Ever got The Call only to discover that somebody else was doing something silly? Or were you the cause of that cry for help? Share your story with an email to the vultures staffing the On Call desk. ®
Follow me for more information.