digital.forest – Page 7 – chuck.goolsbee.org

site rank

Just so you know, this isn’t exactly a technorati realm blog… according to Netcraft, this site ranks as the 8,717,350th most popular website on the planet!

That is going by the name chuck.goolsbee.org, it ranks 8,789,401th by its other name blog.goolsbee.org.

I stumbled upon this stat while doing some work investigating bandwidth usage by some of our clients. We have some folks that pull in a lot of traffic for reasons that are not readily apparent. Usually though they are really obvious, such as:

Adam Engst’s TidBITs at 69,044
Glenn F’s isbn.nu site weighing in at 99,231
The MacSlash boys at 10,542 (a shockingly high rank, way to go guys)
Shawn King’s Your Mac Life at 67,059
John Rizzo’s MacWindows resource site at 24,534
The Steves, who have several sites high in the rankings such as BidNip at 23,828, and cheatcodes.com at 56,623.
Perennial d.f favorites Car*Toys at 74,450
And at the top of the heap, Neoseeker at 4,045

The one that caught me by surprise?

bbs.trailersailor.com at 21,007

Quoted in the Wall Street Journal.

I’ve waited until almost midnight on the west coast to say this (as I don’t really like to toot my own horn) but I was quoted in a technology story in the WSJ today. Pretty cool.

It was an interesting social experiment to see who among my “Internet friends” spotted it and emailed me today. A lot of people I assumed would be WSJ readers did not notice and many who I would not pick as WSJ readers did. Interesting.

User Interface Stupidity

Apple’s QuickTime Streaming Server doesn’t have much of a UI, but there is a fatally stupid design flaw in what little it does have.

You run the UI in a web browser. I use Apple’s own Safari for my web browser (I actually use several different browsers, but I use Safari for administration tasks for some reason). To get fresh data, namely stats on the “server snapshot” page, you need to refresh the browser. But here, have a look at where the geniuses at Apple put the “Disable Server” button… note the proximity to refresh button in Safari!

Tonight, while a good client was “live” and I was providing stats via iChat with their production staff… I accidentally clicked a few pixels too low and shut down the server. D’oh!

In my defence, I am test-driving a new mouse (an Apple Mighty Mouse btw) and my clicking is not as precise as normal, but still. They should put that “stop server” button somewhere else, don’t you think?

Murphy’s Law.

In 2002 we bought a company in the Bay Area… they were a Mac-only ISP, that (typical of the days) did a little bit of everything… hosting, colo, access, development, dial-up, DSL, T1s, car washes, prostitution… *anything* to make a buck.

Peter Lalor (blog roll) ran this company. One of his friends owned an office building in one of those canyons north and west of San Rafael. Peter & friend came up with a money-making scheme for said office building:

1. Pull T1 line to building.
2. Wire office suites with 2-pair from “server closet”, kludge up DSL service
3. Profit!

The building was so far off beaten path that margins were sky high. Tenants had NO choice but DSL from Peter for BIG$. (or Dial-up… yuck)
This is by far, the most profitable section of the Peter’s portfolio of “everything +kitchensink” businesses.

Peter’s business went near-death in the Great Collapsing NorthPoint DSL Disaster of 2001. Peter bails. Peter sells us the ISP part, and the dev biz goes to another company.

We really only want the hosting/colo, but promise to keep as much else as possible running. Over the months/years we scale back as clients move to other providers for access, dial-up, etc. I happily assist clients in these migrations as we are not really in the Internet Access business anymore. We pulled out of that right about the time when DSL was emerging. We look like geniuses in hindsight, but really we didn’t have the capital to acquire DSL infrastructure. We stopped offering Dial-up & T1 in 1997, and had frozen our ISDN business in 1999. We had decided to be JUST a hosting/colo operation by 2000. We knew how to do it, so we managed Peter’s customers’ Internet access for as long as it made sense, and they continued to pay for it.

This particular office building however is still in the shadow of DSL from Pacbell/SBC/ILEC-dujour. We keep making money on This Building. Our CEO won’ let me migrate This Building to a $newprovider. I understand… my role is Operational, not Financial. So, every time I am in the area I visit and check on equipment, stay in touch with clients, etc. Shortly after the acquisition, we replace the PILE of netopia routers in their “server closet” with a ($new +15grand!) Copper Mountain CE150 DSLAM that Peter had ebayed, but could not figure out how to deploy. Things are (mostly) good.

One year later….

Peter’s Friend sells the property.

Copper Mountain dies a deservedly miserable death.

Two years later…

Competing ISP puts wireless tower on ridge above canyon. Sends sales guy door to door in canyon offering wireless access at a fraction of the prices we charge. We lose 60% of our business in the space of two weeks. Our sales guy that managed the remaining accounts slashed prices to meet competitor’s. We go from making Big$ to barely breaking even. I shop for cheaper T1, get it down by 25%, we stay slightly above water.

Three years later….

Office moves, attrition, entropy… we’re close to break even.
I’m mentally prepared for the tipping point when we’ll make the call to shut it down.

Four years later….

NoCal has bad weather, huge power spike/outage hits This Building over this past weekend. DSLAM is unresponsive.

Relay status to support staff.

I call T1 provider, they confirm T1 is up. Hrm… I call the property manager and have him go to the “server closet” (which is a wall-mount rack over a sink in a dark, dank, smelly janitor’s closet!) and check the DSLAM. He says “it is fine” … I can’t ping it, I can’t telnet or get SNMP out of it. PM says “lights are on”… have PM powercycle DSLAM. No joy.

Relay status to support staff.

Monday…
Call the consultant we’ve used locally for help. No answer.
Ask our bookkeeper to provide me with the cost/revenue analysis so I can make a judgement call about whether to just throw in the towel on this business.
Email MGT team that pulling out might be best if we are at or below break even.
Speak to Cust Svc staff to start calling down customers and relay status of situation.

Relay status to support staff.

Appeal to NANOG for remote hands. Get somebody almost instantly, almost walking distance away (!). He goes over, tells me DSLAM power supply is dead (PM was seeing LINK light on DSLAM NIC…sigh.)

Relay status to support staff

Scramble for replacement power supply for CMCE150. Rare as tits on a Bull. Find two whole units on eBay, but sellers don’t respond to email. Search takes hours. Find some on “telephone.com” for $500 each, but they are closed for day (EST). Manage to find a used hardware reseller in PST that has three! They want $750 each. Talk them down to $400 based on east coast option. Buy two, pay for overnight shipping to San Rafael.
(Murphy’s Law of Replacement Hardware: Buy one, it will fail. Buy two, first will never fail.)

Relay status to support & CS staff.
Call PM on-site tell him to expect FedEx, and supply tracking #.

Go to sleep happy.

Tuesday….
Wake up to email from reseller: “We missed FedEx shipping deadline.”

NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO!!!!

Look at fax receipt and find out reseller is in Sacramento. (Actually, in a bit of irony, Roseville CA, location of company that built our current Seattle datacenter before going Tango Uniform in 2001.) While in shower consider flying to Bay Area, renting car, doing it myself. Run in-head cost/benefit and reject. Find way of getting stuff from Sacramento to San Rafael. Arrange courier, but as yet have not contacted reseller (they open at 9am PST).

Relay status to support & CS staff.

At least one client already gone.(A lawyer. There are NO worse clients than Doctors & Lawyers. grumble.)

Finally raise reseller at 9:50 (WTF!?) and tell them courier is coming. Please provide ship-from address. They email me the address… which is in SAN FRANCISCO! Arrgh!!??

Had I KNOWN that from the start I would/could have arranged pickup the PREVIOUS DAY! Grrrr.

Cancel Sacramento courier. Find courier in SF, arrange pickup, tell reseller to expect pickup within an hour.

Relay status to support & CS staff.

Call PM on-site tell him to expect delivery by noon.

Have meeting (& lunch) with IT staff of a big client. While en-route to post-meeting lunch, check office VM from cell… message from SF courier: “Ship-from site says FedEx already picked up package”(!) Manage to keep poker face on in front of important client, but consider murdering waitress with a fork to relieve near-explosive stress. Quietly relay to Sales VP situation… his face not so pokerish… more puckerish.

Get call during lunch from my #2 guy: “getting calls from The Building… status?” Me: “FedEx beat courier to package… down until tomorrow.” (while maintaining poker face for important client at nice eatery.) Muttering on other end of phone.

Important client leaves, and I talk with Sales VP on way back to office: “We’ll need to give these clients at least two free months to have any hope of keeping them.” He agrees.

Get back to office, find uber-apologetic emails from reseller. Play phone tag with them (I suspect they are using caller-ID to avoid me… letting all calls from WA area codes go to VM)

Relay status to support & CS staff.
Relay status to PM on-site.

Consider options for ritual suicide.

Bookkeeper FINALLY gets me cost/revenue numbers. With the Lawyer gone, we are now officially under water on cost/revenue.

Sigh.

Murphy’s Law: Whatever CAN go wrong, WILL go wrong.

Update: Wednesday, March 15, 2006

I call the property manager at The Building, and let him know that FedEx should be there at some point in the morning with the power supplies.

Relay status to support & CS staff.

At about 11 am I get a call from the property manager at The Building who tells me the FedEx guy just left, but did not deliver anything(!). Ahhh! I check the tracking on their website and it says “On Truck for Delivery 8:03 AM San Rafael”, which I relay to the PM. I hang up, start calling FedEx, and working my way through their Cust Svc system… making little or no headway.
Pull out some more hair… shave several more hours off my life…

At around 1 pm, the PM calls me. Says the boxes just arrived! We go through the replacement procedure, and the DSLAM powers up just fine. I am able to ping it from here, and my SNMP management console starts registering traffic… breathe big sigh of relief.

Relay status to support & CS staff.

All that remains now is:

1. taking a pound of flesh from the reseller for botching the delivery on more than one level. Trying to recover some of my wasted courier and shipping costs.

2. Planning for the eventual decommission of this site, as we are losing money on it now. We’ll probably give the clients there 90 days to source a new access provider. Then we’ll finally be out of the access business entirely.

Screamin’ deal on an Xserve

A supplier we use has a bunch of Xserve Cluster Nodes; Dual CPU G5, 2 gigs of RAM, 80gb disk… 33% off retail.

Since you hardly ever see Apple gear at more than 10% to 15% off, this is a great buy. We are buying a bunch for ourselves, and if anyone is interested in grabbing one or more, let me know. Send email to: cg at forest dot net.

Delivery Boy

I shuttled a replacement server up to Vancouver yesterday. Our old DNS server “willow” finally died. Since I live halfway there I drove it up. Everything went well except for two things.

#1: I can’t find my keycard for the Peer1 facility.

No big deal, I call the NOC and a guy comes down to let me in. I walk around from the door I usually go in on the east side of the building down to the loading dock on the north side. Of course I am carrying this 40lb server. Ugh. Not good for my just barely healed back. Then the Peer1 NOC guy locks us out of the loading dock, so we trudge up the loading dock ramps, and around to the SE corner of the building… uphill all the way. My back was really hurting and by the time we got to the elevator inside the building lobby the Peer1 NOC guy must have noted the pain on my face and volunteered to carry the server for a while. We arrive in the datacenter and I’m still in my “work clothes” and a gore-tex jacket. It is HOT in the DC. I hand him the server back (we had swapped again as he unlocked doors) and stripped off the jacket. Thankfully our little server enclosure (a wire mesh “hockey locker”) has an HVAC vent right above it so while I’m working I have cold air blowing on me.

#2: The damn server doesn’t fit in the enclosure!

This trend of making servers 1U high and as long as an aircraft carrier is just completely out of control. This box is a Dell server, and it is about 1″ deeper than the rack it is in. I end up having to stand in on its nose. Plus I have to carve off the RJ-45 cable boot in order to thread the cable into the deeply recessed jack. I guess I’ll talk to Peer1 about exchanging our rack for a different one.

So now my back is hurt again, and our server is mounted vertically.

Russian Roulette …with bombs.

How one software author’s unwise decision ruined my week.

Apologies for the long one, but it explains my lack of writing anything else this week.

Monday evening, as I was getting ready to take my youngest son out for what was likely to be his last Halloween (he’ll be turning 12 in a little over a month and 11 seems to be about the time that “kid stuff” starts losing it’s appeal) “trick or treat” with his friends, somebody pulled a trick on me that ruined my week.

Some history first though: We use a very nice mail server package called “Communigate Pro” by what used to be named “Stalker Software.” Communigate Pro (aka “CGP”) has a reputation for being fast, stable, and scalable. For the most part this has been true for us. We have had some issues with it though over the past four and a half years. We run CGP on several servers, since CGP has been used by several of the web hosting companies we have acquired over the years. The copy of it we bought for ourselves though has been the one that has caused us problems. It runs great for 50 weeks of the year, but for a week in August or September, and a week in December or January, it completely sucks rocks. The only way I can describe it is that interacting with CGP becomes like talking to a starfish.

I watched a show once that well illustrated at least one definition of the word “Relativity.” It showed how nature has made metabolism something of a clock, and that each species operates on a relative clock speed based on their metabolism. If you time-lapse film slow metabolism creatures like starfish, and then adjust the speed up to “match” our metabolic rate… the starfish look very active… zipping about the ocean floor, preying on urchins and other shellfish. Amazing really. Same goes in the other direction, slow down the film of a hummingbird and they start looking like any other bird. I guess to a Hummingbird, a human being looks like a starfish.

Well, for two weeks out of the year our CGP mail server’s metabolism slows to one of a starfish. It works, just at a truly GLACIAL pace. The Server and Operating system are fine (load is low, machine is responsive at the console, shell commands are fine, go figure.) This is obviously frustrating – for both us and our clients. The fact that it comes back like clockwork at certain times of the year is very odd. We eliminated all external causes (traffic, spam, etc) and Stalker support spent hours and hours trying to figure out what was wrong. The only suggestion they could ever come up with was “put a faster filesystem under it.” This error appeared in whatever version of CGP we ran, and I’m pretty sure that we tried them all, starting with 4.0.X, all the way up to 4.2.X (and this week, 4.3.X… but we’ll cover that later) but they all had that odd metabolism time shift appear twice a year.

Putting a faster file system under it usually cleared up the problem. As did switching platforms. We started on FreeBSD, moved to OS X (better threading), then up to OS X Server (on an Xserve); but also we jumped through all sorts of filesystem and bus technology switches, such as IDE, to SCSI, to various RAID setups, to eventually a 2Gb/s FibreChannel RAID array. Last summer when the starfish returned, on a whim (well, not a whim really, more a blind rage and pique of frustration since I wasn’t going to sink any more capital into filesystem improvements!!! Especially since they were seemingly NOT improving the situation!) I told my senior sysadmin to move the CGP directories to the internal IDE drive of the Xserve. Presto! The starfish vanished.

The server was back to it’s responsive, stable state. While I was happy with regards to that, since our clients weren’t angry at us, I was LIVID because all those tens of thousands of dollars we’d spent on hardware was a placebo cure for a real software problem. Stalker (now calling themselves “Communigate Systems”… aka CGS) had no explanation for this, and just sort of slinked away.

There is another significant wrinkle to this story, which explains why I was unable and unwilling to ride Stalker/CGS harder and force the issue into some sort of resolution. In November of 2004, CGS nee Stalker, made significant changes to their software licensing model, and jacked their prices up well over 5.5X their previous levels. Needless to say it was a shock to their customers. Prior to this date, their software was “expensive” but a relatively good value. (IIRC we paid between $8000 and $16,000 for our CGP licenses in 2000 and 2001.) Up until 2004 the core customer for Stalker were Service Providers such as ourselves. CGP had become something of a darling in the Industry press for being a solid performer and a far better value than absurdly over-done and outrageously expensive “Messaging Platforms” such as Lotus Notes and Microsoft Exchange. I guess this attention went to the head of Stalker/CGS’ CEO and founder Vladimir Butenko, and he began transforming CGP into one of those over-done and outrageously expensive “Messaging Platforms”. Hey, in some ways I can’t blame the guy… his core market – ISPs – had gone from niche-market players to a total commodity market with NOBODY making very much money, if any. Just beyond his grasp, and seemingly within reach was a cash-rich “Enterprise Market” with some dominant players showing real weakness. The astounding thing is the way he decided to get there: by actively pissing off their current customers and seeding them with confusion, fear and doubt. The existing customers, all ISPs, schools, and small businesses were angry. Stalker/CGS left no option for a “mail only” (no calendaring, groupware, MAPI support, VOIP support, SIP/PBX functionality, etc) version, and any continued use, other than the VERSION YOU ORIGINALLY BOUGHT would cost you a hefty sum in support and maintenance fees – 18% of purchase price, which in the new scheme was actually what you paid originally! So it was like having to buy your software again every year. Customers were livid, and the sturm and drang on Stalker’s support mailing list was out of control. Stalker’s CEO, Vladimir Butenko defended these new policies with characteristic Russian twisted logic and denial. I don’t know how to say “tough shit” in Russian, but that is what he did, albeit in far more diplomatic terms.

What he didn’t tell anyone at the time was that he ensured compliance with his new licensing scheme and inflated prices by inserting a “time bomb” into Communigate Pro. If your server thought it wasn’t properly licensed, it would cease to run at midnight UTC on some arbitrary date, and then, if re-launched would shut itself down ever 15 or 20 minutes thereafter. No warning. No coherent error code. No reason why. Bang. Boom. Off. Dead.

This was done without any announcement or warning. It add insult to injury, none of us customers had any idea which versions of Communigate Pro had the timebomb code in it or what the dates for explosion were. It was truly “Russian Roulette.”

Up until 2005, the standard refrain from Stalker Tech Support for any issue was “Please Upgrade to the latest version of Communigate Pro.” The support and sales staff frequently touted the benefit of “free upgrades” of their software. You got your value and return on your initial investment by always being able to stay current and get your bug fixes. We had changed versions via upgrade countless times, as we obviously had at least ONE big ugly bug, which unfortunately was never fixed. I don’t recall what version of CGP we were running when the license change was announced, but I knew that in February of 2005, when the (first of what I now assume are going to be many) CGP timebombs exploded we were running a version we weren’t apparently licensed for… despite the fact that we probably upgraded to it two months before while troubleshooting our latest visit from the Communigate Pro Starfish Mode. CGP servers around the globe all blew up at midnight UTC on February 1st 2005, including one of ours. Predictably the CGP support mailing lists, newsgroups etc also exploded with angry, frustrated customers. I called the guy at Stalker who we originally bought the software from and asked him flat out, “OK, tell me exactly what version of CGP we are allowed to run so that this timebomb won’t affect us again.” Bill, my senior sysadmin downgraded us to that version on February 1st, and life went on.

Later in 2005 our CGP Starfish returned, and that is when we tried the “move to internal IDE disk” trick which worked. I had not paid Stalker that hefty price for support and maintenance (or as they ironically call in their emails to me “S&M”) so I was in no position to demand that they admit this “starfish mode” bug exists and fix it. I was stuck at the version we were running for perpetuity. Such is the Kafka-esque world of software licensing. Instead I directed my staff to start evaluating alternatives to Communigate Pro. I didn’t want to be the victim of extortion to pay for the development of features for “Enterprise Customers” that we would NEVER use. Here is a great example: I was on the phone with a guy from Stalker/CGS and he was telling me how great their PBX/SIP/VOIP system was. I asked him “How do our customers call us if the mail server goes down?” I was answered by a very long silence… followed eventually by “Hmmm… never thought of that.” SMTP/POP/IMAP/Webmail… that is ALL I need thank you. So we looked at the expanding pool of products that were filling the void being left by CGP as it acsended to “Enterprise” status. We had narrowed the field to a small handful by last week.

Then we lost at Russian Roulette again.

At 4pm PST on October 31st, which is Midnight UTC, three of the 4 Communigate Pro servers at our facility exploded. Their timebombs went off and they all shut themselves down. My wife had to fill in for me as the Halloween driver (we live in a rural area, so I had planned on taking my son, and a few of his friends into town for trick-or-treating.) I spent the night hunched over my keyboard and on my VOIP phone (thankfully we don’t use Communigate Pro for our VOIP needs!) to my office dealing with the crisis. Based on past events, we very quickly came to the conclusion that it was the infamous Communigate Pro Time Bomb, and not some other issue since it happened at precisely the same time on more than one server, and we were not the only ones it was happening to. (Stalker’s mailing list, which is viewable on the web also was exploding with angry customers.) To get us through the night we rolled the clocks back on the CGP servers, and restarted them. In the morning we started the work of figuring out how to deal with this. I emailed Stalker trying to find out why, when they had told us that THIS version was OK for us, that it still had timebombed. I posted, and replied to other’s postings on the CGP mailing list, but my account was in “moderated” mode, and the moderator was obviously not paying attention (easy to do as that is a significant weakness of the CGP LIST module.) Vladimir Butenko appeared on the list, once again in his twisted Russian logic saying essentially ‘there is no timebomb, and besides you must be stealing my software since your server stopped working.’ Not exactly a confidence or trust building exercise in customer relations there Vlad.

After careful reading of the CGP website, I finally decided that our only course of action was to downgrade to version 4.1.8, which seems to be the last of the “free upgrades” and should run on our license key obtained in 2000. Bill figured he could downgrade the software, and restart the CGP service without causing much disruption to our clients. 4.1.8 went on, we restarted, and suddenly, without warning…

The Starfish Returns!

Our mail server software is once again, moving at the speed of a quaalude-soaked starfish taking a leisurely creep over the ocean floor. It is 7 weeks early, but the starfish is back… with a vengeance!

Great. Just what we need. A software vendor extorting us on one side, and clients angry at us for under-performing software on the other. My loyalty is with my clients, not the bastard that is holding the gun to my head, or the timebomb on my server as the case may be. I rally the staff and roll out a plan; we’ll build a new server from scratch, install a fresh OS and a new install of CGP 4.1.8 on it, move the data over to it and cutover the IP address. Based on our past experience, this should outwit the Starfish!

Thankfully a customer had just decommissioned a very nice Dual CPU/Dual Core Intel server with a built-in Ultra-SCSI RAID system, and we made him an offer on it that he accepted. The only problem with it was the drives inside were low-capacity. Thankfully we have stacks of Sun Stor-Edge Array’s in our backup system that were in an idle state, so we ripped out 6 36GB LVD Ultra-SCSI drives from one and packed them in the server, installed FreeBSD on it, and started rsync on a cross-over cable between it and our production mail server. Oddly enough this went pretty fast, despite CGP in “Starfish Mode” the OS and filesystem is thankfully quite responsive. System load went from 0.10 to 0.34 on the production server while we were syncing… while talking to the Starfish was unbearably slow. For example CGP’s web UI would take 15 minutes to click from page to page.

We cutover to the fresh box at around midnight on Tuesday/Wednesday, and things seemed ‘OK’… instead of talking to a starfish, it felt like talking to a sleepy dog. Movement was perceptible, but not exactly as swift as we had hoped. In past experience “starfish mode” would improve to reasonable performance in the wee hours of the night when the server was under lesser mail load. Since I was staying in my office and had nothing else to do, I vented about this situation to my online friends, discussed via phone with Russ Pagenkopf, the guy I run the Mac-Mgrs list with… ironically running on a Stalker-donated copy of CGP, which also quite ironically had also timebombed! Russ & I decided to cease running CGP on the Mac-Mgrs list server as soon as possible, and once he had it running again I posted to the list about that. I also answered people who were angry on the CGP list about what was going on with us, and some of them relayed to that list what I had said, both to them, and on Mac-Mgrs. The PR backlash at Stalker/CGS was gaining momentum. I think I managed to get about 3 hours of sleep that night.

Sure enough come Wednesday morning east-coast business hours our main server was back to moving like a starfish. I left my staff to handle the angry clients, while I swallowed my anger and called Stalker/CGS for tech support. I didn’t expect much, but luck was on my side and by chance a Director-level employee answered the phone (When our tech support queue gets busy, I pick up the phone too!) I explained our situation with CGP 4.1.8 doing this “glacial slowdown” thing (I haven’t called it “starfish mode” with anyone at Stalker/CGS to date.) I asked him if my long-time contact was there, and he said, “yes, he just walked into the office” so I said to catch up with him since he knew the full history of this almost 5 year old problem and I didn’t have the energy to relate it to him. After a few hours of troubleshooting (it took me 55 minutes just to get to the UI to change a password so Stalker support could access the server) I got a call from them. Three people, all director-level folks at Stalker/CGS were on the phone and making me an offer. They would give us a 90-day License for CGP 4.3.9 to let us load that one up and see if it would fix the “Starfish Mode” bug. I was too exhausted to say anything but “it is worth a try”…. They promised me quotes for extending the 90-day license within a day.

License keys in hand, I woke up Bill, our over-worked and underslept senior sysadmin and had him install the 4.3.9 version on our creeping starfish of a server and restart…. it seemed OK for about 30 secdonds, then immediately tailspun back down to starfish mode once again. It is obvious whatever this bug is, it has never been adequately addressed by Stalker’s coders and remains embedded deep within the current version, and probably in upcoming ones as well. The Stalker support guys were stumped, and fell back into random-mode troubleshooting again, suggesting courses of action which were either impossible due to not being able to perfrom them on such a slow moving system, or stuff they had suggested in the past – which we knew would not work.

I had a plan. It was a total “hail mary” play, but similar stunts had worked for us in the past with the Starfish. Nuke the box we had been running the mailserver on just days before… before the software timebomb exploded. Fresh install of this CGP upgrade, move the data over to it and cutover again. This may sound like what we just tried, and it does. Meanwhile I talked to MY director level guy and said, where ever we are with the proposed new mail system roll-out, hit the gas pedal and get ready to install and ramp it up ASAP! He brought me PO’s for gear and software, and I signed them. I wrote an apology to our clients about the situation, and posted it to our website. I grabbed my laptop and left my office for the first time in almost three days to get some fresh air, and food. I had the laptop as it seems that open wireless networks are everywhere now, so if they needed me at the office I could probably get on AIM or whatnot easily.

Bill finished the install and rsync work, and we cut over to the “old” mailserver around 5 PM PST on Wednesday and….

It worked. The starfish was back in hibernation once again, and the server was behaving “normally.”

I finished up some client communications, and basically passed out on my office couch a few hours later. I slept 12 hours straight.

So, at the moment I have 90 days to get a better mail system rolled out and running. I think we can get that done. We’ll probably build a fresh, old CGP 4.1.8 system to leave any clients that can’t/won’t move to the new system, so we’ll stay in compliance with Stalker/CGS’ looney license scheme, and perpetually avoid the Russian Roulette with Software Timebombs present in CGP 4.2.X and who knows what subsequent versions. We’ll probably NEVER get a satisfactory answer about the causes, or real cures for Communigate Pro’s “Starfish Mode”… but here is my hope:

Someday, it will return. Not to *our* server, but to one of these “Enterprise Customers” that Stalker/CGS so desperately wants to trade their current customers for. Some multi-million dollar CGP “Messaging Platform” cluster installations. They’ll have hundreds of thousands of dollars invested in hardware, and of course CGP software. Their mighty cluster will slow to an inexplicable crawl. They’ll spend massive amounts of time, and eventually money, trying to cure it. Vladimir will log into it and tell them “Put a faster filesystem under it”, so they’ll blow wads and wads of cash at exotic SAN architectures or the like. VP-level guys like me will lose sleep and in-the-trenches guys will loose even more trying to fix the problem of wrestling with a starfish. Then, some geek in the organization will be google-surfing phrases like “CGP slow” or “glacial communigate” and stumble upon this blog entry from who knows how many years past. He’ll pass it up the chain, and somebody will gather up the guts to call me. I chuckle and say “You spent HOW much money to buy this software from these idiots? What, are you NUTS?”

There, I just saved you the phone call.