Apologies for the long one, but it explains my lack of writing anything else this week.
Monday evening, as I was getting ready to take my youngest son out for what was likely to be his last Halloween (he’ll be turning 12 in a little over a month and 11 seems to be about the time that “kid stuff” starts losing it’s appeal) “trick or treat” with his friends, somebody pulled a trick on me that ruined my week.
Some history first though: We use a very nice mail server package called “Communigate Pro” by what used to be named “Stalker Software.” Communigate Pro (aka “CGP”) has a reputation for being fast, stable, and scalable. For the most part this has been true for us. We have had some issues with it though over the past four and a half years. We run CGP on several servers, since CGP has been used by several of the web hosting companies we have acquired over the years. The copy of it we bought for ourselves though has been the one that has caused us problems. It runs great for 50 weeks of the year, but for a week in August or September, and a week in December or January, it completely sucks rocks. The only way I can describe it is that interacting with CGP becomes like talking to a starfish.
I watched a show once that well illustrated at least one definition of the word “Relativity.” It showed how nature has made metabolism something of a clock, and that each species operates on a relative clock speed based on their metabolism. If you time-lapse film slow metabolism creatures like starfish, and then adjust the speed up to “match” our metabolic rate… the starfish look very active… zipping about the ocean floor, preying on urchins and other shellfish. Amazing really. Same goes in the other direction, slow down the film of a hummingbird and they start looking like any other bird. I guess to a Hummingbird, a human being looks like a starfish.
Well, for two weeks out of the year our CGP mail server’s metabolism slows to one of a starfish. It works, just at a truly GLACIAL pace. The Server and Operating system are fine (load is low, machine is responsive at the console, shell commands are fine, go figure.) This is obviously frustrating – for both us and our clients. The fact that it comes back like clockwork at certain times of the year is very odd. We eliminated all external causes (traffic, spam, etc) and Stalker support spent hours and hours trying to figure out what was wrong. The only suggestion they could ever come up with was “put a faster filesystem under it.” This error appeared in whatever version of CGP we ran, and I’m pretty sure that we tried them all, starting with 4.0.X, all the way up to 4.2.X (and this week, 4.3.X… but we’ll cover that later) but they all had that odd metabolism time shift appear twice a year.
Putting a faster file system under it usually cleared up the problem. As did switching platforms. We started on FreeBSD, moved to OS X (better threading), then up to OS X Server (on an Xserve); but also we jumped through all sorts of filesystem and bus technology switches, such as IDE, to SCSI, to various RAID setups, to eventually a 2Gb/s FibreChannel RAID array. Last summer when the starfish returned, on a whim (well, not a whim really, more a blind rage and pique of frustration since I wasn’t going to sink any more capital into filesystem improvements!!! Especially since they were seemingly NOT improving the situation!) I told my senior sysadmin to move the CGP directories to the internal IDE drive of the Xserve. Presto! The starfish vanished.
The server was back to it’s responsive, stable state. While I was happy with regards to that, since our clients weren’t angry at us, I was LIVID because all those tens of thousands of dollars we’d spent on hardware was a placebo cure for a real software problem. Stalker (now calling themselves “Communigate Systems”… aka CGS) had no explanation for this, and just sort of slinked away.
There is another significant wrinkle to this story, which explains why I was unable and unwilling to ride Stalker/CGS harder and force the issue into some sort of resolution. In November of 2004, CGS nee Stalker, made significant changes to their software licensing model, and jacked their prices up well over 5.5X their previous levels. Needless to say it was a shock to their customers. Prior to this date, their software was “expensive” but a relatively good value. (IIRC we paid between $8000 and $16,000 for our CGP licenses in 2000 and 2001.) Up until 2004 the core customer for Stalker were Service Providers such as ourselves. CGP had become something of a darling in the Industry press for being a solid performer and a far better value than absurdly over-done and outrageously expensive “Messaging Platforms” such as Lotus Notes and Microsoft Exchange. I guess this attention went to the head of Stalker/CGS’ CEO and founder Vladimir Butenko, and he began transforming CGP into one of those over-done and outrageously expensive “Messaging Platforms”. Hey, in some ways I can’t blame the guy… his core market – ISPs – had gone from niche-market players to a total commodity market with NOBODY making very much money, if any. Just beyond his grasp, and seemingly within reach was a cash-rich “Enterprise Market” with some dominant players showing real weakness. The astounding thing is the way he decided to get there: by actively pissing off their current customers and seeding them with confusion, fear and doubt. The existing customers, all ISPs, schools, and small businesses were angry. Stalker/CGS left no option for a “mail only” (no calendaring, groupware, MAPI support, VOIP support, SIP/PBX functionality, etc) version, and any continued use, other than the VERSION YOU ORIGINALLY BOUGHT would cost you a hefty sum in support and maintenance fees – 18% of purchase price, which in the new scheme was actually what you paid originally! So it was like having to buy your software again every year. Customers were livid, and the sturm and drang on Stalker’s support mailing list was out of control. Stalker’s CEO, Vladimir Butenko defended these new policies with characteristic Russian twisted logic and denial. I don’t know how to say “tough shit” in Russian, but that is what he did, albeit in far more diplomatic terms.
What he didn’t tell anyone at the time was that he ensured compliance with his new licensing scheme and inflated prices by inserting a “time bomb” into Communigate Pro. If your server thought it wasn’t properly licensed, it would cease to run at midnight UTC on some arbitrary date, and then, if re-launched would shut itself down ever 15 or 20 minutes thereafter. No warning. No coherent error code. No reason why. Bang. Boom. Off. Dead.
This was done without any announcement or warning. It add insult to injury, none of us customers had any idea which versions of Communigate Pro had the timebomb code in it or what the dates for explosion were. It was truly “Russian Roulette.”
Up until 2005, the standard refrain from Stalker Tech Support for any issue was “Please Upgrade to the latest version of Communigate Pro.” The support and sales staff frequently touted the benefit of “free upgrades” of their software. You got your value and return on your initial investment by always being able to stay current and get your bug fixes. We had changed versions via upgrade countless times, as we obviously had at least ONE big ugly bug, which unfortunately was never fixed. I don’t recall what version of CGP we were running when the license change was announced, but I knew that in February of 2005, when the (first of what I now assume are going to be many) CGP timebombs exploded we were running a version we weren’t apparently licensed for… despite the fact that we probably upgraded to it two months before while troubleshooting our latest visit from the Communigate Pro Starfish Mode. CGP servers around the globe all blew up at midnight UTC on February 1st 2005, including one of ours. Predictably the CGP support mailing lists, newsgroups etc also exploded with angry, frustrated customers. I called the guy at Stalker who we originally bought the software from and asked him flat out, “OK, tell me exactly what version of CGP we are allowed to run so that this timebomb won’t affect us again.” Bill, my senior sysadmin downgraded us to that version on February 1st, and life went on.
Later in 2005 our CGP Starfish returned, and that is when we tried the “move to internal IDE disk” trick which worked. I had not paid Stalker that hefty price for support and maintenance (or as they ironically call in their emails to me “S&M”) so I was in no position to demand that they admit this “starfish mode” bug exists and fix it. I was stuck at the version we were running for perpetuity. Such is the Kafka-esque world of software licensing. Instead I directed my staff to start evaluating alternatives to Communigate Pro. I didn’t want to be the victim of extortion to pay for the development of features for “Enterprise Customers” that we would NEVER use. Here is a great example: I was on the phone with a guy from Stalker/CGS and he was telling me how great their PBX/SIP/VOIP system was. I asked him “How do our customers call us if the mail server goes down?” I was answered by a very long silence… followed eventually by “Hmmm… never thought of that.” SMTP/POP/IMAP/Webmail… that is ALL I need thank you. So we looked at the expanding pool of products that were filling the void being left by CGP as it acsended to “Enterprise” status. We had narrowed the field to a small handful by last week.
Then we lost at Russian Roulette again.
At 4pm PST on October 31st, which is Midnight UTC, three of the 4 Communigate Pro servers at our facility exploded. Their timebombs went off and they all shut themselves down. My wife had to fill in for me as the Halloween driver (we live in a rural area, so I had planned on taking my son, and a few of his friends into town for trick-or-treating.) I spent the night hunched over my keyboard and on my VOIP phone (thankfully we don’t use Communigate Pro for our VOIP needs!) to my office dealing with the crisis. Based on past events, we very quickly came to the conclusion that it was the infamous Communigate Pro Time Bomb, and not some other issue since it happened at precisely the same time on more than one server, and we were not the only ones it was happening to. (Stalker’s mailing list, which is viewable on the web also was exploding with angry customers.) To get us through the night we rolled the clocks back on the CGP servers, and restarted them. In the morning we started the work of figuring out how to deal with this. I emailed Stalker trying to find out why, when they had told us that THIS version was OK for us, that it still had timebombed. I posted, and replied to other’s postings on the CGP mailing list, but my account was in “moderated” mode, and the moderator was obviously not paying attention (easy to do as that is a significant weakness of the CGP LIST module.) Vladimir Butenko appeared on the list, once again in his twisted Russian logic saying essentially ‘there is no timebomb, and besides you must be stealing my software since your server stopped working.’ Not exactly a confidence or trust building exercise in customer relations there Vlad.
After careful reading of the CGP website, I finally decided that our only course of action was to downgrade to version 4.1.8, which seems to be the last of the “free upgrades” and should run on our license key obtained in 2000. Bill figured he could downgrade the software, and restart the CGP service without causing much disruption to our clients. 4.1.8 went on, we restarted, and suddenly, without warning…
The Starfish Returns!
Our mail server software is once again, moving at the speed of a quaalude-soaked starfish taking a leisurely creep over the ocean floor. It is 7 weeks early, but the starfish is back… with a vengeance!
Great. Just what we need. A software vendor extorting us on one side, and clients angry at us for under-performing software on the other. My loyalty is with my clients, not the bastard that is holding the gun to my head, or the timebomb on my server as the case may be. I rally the staff and roll out a plan; we’ll build a new server from scratch, install a fresh OS and a new install of CGP 4.1.8 on it, move the data over to it and cutover the IP address. Based on our past experience, this should outwit the Starfish!
Thankfully a customer had just decommissioned a very nice Dual CPU/Dual Core Intel server with a built-in Ultra-SCSI RAID system, and we made him an offer on it that he accepted. The only problem with it was the drives inside were low-capacity. Thankfully we have stacks of Sun Stor-Edge Array’s in our backup system that were in an idle state, so we ripped out 6 36GB LVD Ultra-SCSI drives from one and packed them in the server, installed FreeBSD on it, and started rsync on a cross-over cable between it and our production mail server. Oddly enough this went pretty fast, despite CGP in “Starfish Mode” the OS and filesystem is thankfully quite responsive. System load went from 0.10 to 0.34 on the production server while we were syncing… while talking to the Starfish was unbearably slow. For example CGP’s web UI would take 15 minutes to click from page to page.
We cutover to the fresh box at around midnight on Tuesday/Wednesday, and things seemed ‘OK’… instead of talking to a starfish, it felt like talking to a sleepy dog. Movement was perceptible, but not exactly as swift as we had hoped. In past experience “starfish mode” would improve to reasonable performance in the wee hours of the night when the server was under lesser mail load. Since I was staying in my office and had nothing else to do, I vented about this situation to my online friends, discussed via phone with Russ Pagenkopf, the guy I run the Mac-Mgrs list with… ironically running on a Stalker-donated copy of CGP, which also quite ironically had also timebombed! Russ & I decided to cease running CGP on the Mac-Mgrs list server as soon as possible, and once he had it running again I posted to the list about that. I also answered people who were angry on the CGP list about what was going on with us, and some of them relayed to that list what I had said, both to them, and on Mac-Mgrs. The PR backlash at Stalker/CGS was gaining momentum. I think I managed to get about 3 hours of sleep that night.
Sure enough come Wednesday morning east-coast business hours our main server was back to moving like a starfish. I left my staff to handle the angry clients, while I swallowed my anger and called Stalker/CGS for tech support. I didn’t expect much, but luck was on my side and by chance a Director-level employee answered the phone (When our tech support queue gets busy, I pick up the phone too!) I explained our situation with CGP 4.1.8 doing this “glacial slowdown” thing (I haven’t called it “starfish mode” with anyone at Stalker/CGS to date.) I asked him if my long-time contact was there, and he said, “yes, he just walked into the office” so I said to catch up with him since he knew the full history of this almost 5 year old problem and I didn’t have the energy to relate it to him. After a few hours of troubleshooting (it took me 55 minutes just to get to the UI to change a password so Stalker support could access the server) I got a call from them. Three people, all director-level folks at Stalker/CGS were on the phone and making me an offer. They would give us a 90-day License for CGP 4.3.9 to let us load that one up and see if it would fix the “Starfish Mode” bug. I was too exhausted to say anything but “it is worth a try”…. They promised me quotes for extending the 90-day license within a day.
License keys in hand, I woke up Bill, our over-worked and underslept senior sysadmin and had him install the 4.3.9 version on our creeping starfish of a server and restart…. it seemed OK for about 30 secdonds, then immediately tailspun back down to starfish mode once again. It is obvious whatever this bug is, it has never been adequately addressed by Stalker’s coders and remains embedded deep within the current version, and probably in upcoming ones as well. The Stalker support guys were stumped, and fell back into random-mode troubleshooting again, suggesting courses of action which were either impossible due to not being able to perfrom them on such a slow moving system, or stuff they had suggested in the past – which we knew would not work.
I had a plan. It was a total “hail mary” play, but similar stunts had worked for us in the past with the Starfish. Nuke the box we had been running the mailserver on just days before… before the software timebomb exploded. Fresh install of this CGP upgrade, move the data over to it and cutover again. This may sound like what we just tried, and it does. Meanwhile I talked to MY director level guy and said, where ever we are with the proposed new mail system roll-out, hit the gas pedal and get ready to install and ramp it up ASAP! He brought me PO’s for gear and software, and I signed them. I wrote an apology to our clients about the situation, and posted it to our website. I grabbed my laptop and left my office for the first time in almost three days to get some fresh air, and food. I had the laptop as it seems that open wireless networks are everywhere now, so if they needed me at the office I could probably get on AIM or whatnot easily.
Bill finished the install and rsync work, and we cut over to the “old” mailserver around 5 PM PST on Wednesday and….
It worked. The starfish was back in hibernation once again, and the server was behaving “normally.”
I finished up some client communications, and basically passed out on my office couch a few hours later. I slept 12 hours straight.
So, at the moment I have 90 days to get a better mail system rolled out and running. I think we can get that done. We’ll probably build a fresh, old CGP 4.1.8 system to leave any clients that can’t/won’t move to the new system, so we’ll stay in compliance with Stalker/CGS’ looney license scheme, and perpetually avoid the Russian Roulette with Software Timebombs present in CGP 4.2.X and who knows what subsequent versions. We’ll probably NEVER get a satisfactory answer about the causes, or real cures for Communigate Pro’s “Starfish Mode”… but here is my hope:
Someday, it will return. Not to *our* server, but to one of these “Enterprise Customers” that Stalker/CGS so desperately wants to trade their current customers for. Some multi-million dollar CGP “Messaging Platform” cluster installations. They’ll have hundreds of thousands of dollars invested in hardware, and of course CGP software. Their mighty cluster will slow to an inexplicable crawl. They’ll spend massive amounts of time, and eventually money, trying to cure it. Vladimir will log into it and tell them “Put a faster filesystem under it”, so they’ll blow wads and wads of cash at exotic SAN architectures or the like. VP-level guys like me will lose sleep and in-the-trenches guys will loose even more trying to fix the problem of wrestling with a starfish. Then, some geek in the organization will be google-surfing phrases like “CGP slow” or “glacial communigate” and stumble upon this blog entry from who knows how many years past. He’ll pass it up the chain, and somebody will gather up the guts to call me. I chuckle and say “You spent HOW much money to buy this software from these idiots? What, are you NUTS?”
There, I just saved you the phone call.