PowerBook repair wrap-up & epilogue

I doubt I’ll ever walk into that store again.. or ANY Apple Store for that matter. I have been into two of them in search of assistance so far, and I can say without hesitation that I have had dental work that was a far more pleasurable experience. Root canal? Sure. Apple Store? Not unless they let me suck on Nitrous Oxide at the Genius Bar.

I never did post a post-repair summary with Apple when it finished in February. Today something happened that prompted me to finally write it up. Basically my conclusion is that Apple has seriously dropped the ball with regards to the repair process at their retail stores.

From start to finish it was a BAD experience for me. Things that should take minutes stretched into HOURS. Today I found out I am not alone in that experience. So let me summarize the wait/queue/loiter times for you here, then I’ll follow up with a rant from a friend of mine (also a digital.forest client) who just had a similar experience as my own with repair at an Apple Store.

1. Diagnosis and Drop off: ~30 minutes wait time.
I walked into the store, with a known recall issue, complete with a printout of the Apple web page outlining the recall. I had to turn away from the counter, and make a reservation at another computer. It said I had a 15 minute wait to see the “Genius”.
The Genius wasn’t helping anyone at that moment. A few minutes later he called some names. Mine wasn’t one of them. None of the people were there who had reservations. Then he vanished into the back room for some time. Five minutes AFTER my assigned time, he re-appeared and called SOMEBODY ELSE’S name… who also wasn’t there. He fiddles some on a computer, then called my name. We then went through diagnosis/etc… documented elsewhere.

He was about to take the computer away when he asked if I had a backup of my data. I did, but it was a day old, so I decided to take it back to my office, make a new one, and bring it back the next day. Since all the paperwork was done, I assumed that the drop off would be easy. It wasn’t.

2. Drop off, take two: ~90 minute wait time(!)
OK, so I return the next night to drop off my powerbook. For some reason I thought that since I had been through the entire process of diagnosis and filling out the paperwork/repair work order/etc that I could just walk in and hand it to them. Boy was I wrong. I walked in and just like the night before they would not even talk to me until after I had made reservation. I went ahead and checked in on their computer reservation system only to find out it they were booked for another 70+ minutes. I grabbed the next open slot, walked out to my car, grabbed a book and went off to find something to eat elsewhere in the mall. I grab a meal from Taco Time, sit and read my book for an hour, then amble back to the Apple Store. Unlike an hour before, when the store was literally empty, it was now stuffed with humans. This is one of Apple’s “Mini stores” so it was unimaginably chaotic with that many people. My time reservation time comes and goes, and I’m still loitering. FINALLY they call my name, I walk up, hand them the powerbook, confirm the name, turn around and walk out.
Summary: A 30 second transaction which took an hour and a half due to an inane process.

3. Repair Wait Time: Two Weeks. (documented elsewhere)

4. Pick-up: ~120 minutes wait time. The Icing on the Cake of Bad Customer Experience.
They called me to let me know it was ready to pick up. I could not get there for almost 24 hours due to being 70+ miles away, so I show up the next night. The store is moderately busy. I walk up to a store employee, and tell them that I received a call to pick up my repaired powerbook. Did they walk back and grab it? No. Her answer, as if she were an automoton: “You need to make a reservation.” “Just to pick it up?” I ask. “Yes sir.” she replied. Frustrated, I spun around, and once again, entered my name into the queue and found that I was destined to wait at least 45 minutes. Sigh. I amble out to the car and listen to music for a while… to soothe my savage breast I guess. 30 minutes go by and I head back into the store.

It is a literal mob scene. There are at least 15 people loitering about the counter in the back of the store. Two people are being assisted, one with an iPod, another with a G5. There is a display on the back wall which alternates between “Mac Hints” and a status board. The status board is obviously broken since it just repeats the same thing over and over:

Next Customers:
Next open slot at about:

No data mind you JUST the text above, so I was ignorant of how many of these people were ahead of me. In the end it turned out to be ALL of them. I waited, and waited, and waited. I waited some more. They helped everyone in their turn. The crowded store slowly became less crowded. They called lots of names. Easily 90% of those people were not there. They would call a name and wait… it was like an 80’s replay: “Bueller? Bueller? …Bueller?” Then there would be a pause, and another name… and a pause, and another name. At one point I started counting the names without bodies… there were over 15 called, while at least four actual humans just stood there waiting to be called. I waited and waited and waited right along with them.

So in the end, I waited through everyone who was there when I arrived, and a whole bunch of people who weren’t even there(!), and it was down to me and one other person… and they called her name. Sigh. I had been standing around now for well over an hour. All this time I have made direct eye contact with every single Apple store employee on several occasions. Not a single one of them asked me if I needed help, or anything.

They finish up with the woman, and then call several other names(!) I’m the ONLY person in the whole damn store other than staff. This is beyond absurd. It is surreal. Why the guy didn’t just speak directly to me is something I’ll never know.

The Genius (I guess they must call them that for their deductive reasoning skills!) finally says “Then you must be Chuck.” I affirmed his less-than-brilliant deduction, and told him I was there to pick up my repaired powerbook.

There are times I wish I could capture a moment and hold it… spin it in my hand, carry it away, to be replayed for another person. This moment was one of them. It was obvious that this “Genius” felt a profound sense of embarrassment at that very moment. He knew how long I had been waiting. He had seen me patiently waiting, standing in the very same spot, for well over an hour. In fact he had seen me walk into the store almost two hours before and seen me get turned away. The whole surreal absurdity of this stupid reservation system and forcing people to queue like soup-kitchen panhandlers or Soviet-Era bread lines finally collided with his retail reality. I wish I could have captured that moment so that I could transport it down to Cupertino and reveal it to the pinheads who thought up this insanity and provide them with clue on why this is antithetical to what a good service organization does for and with their customers. Unfortunately I couldn’t capture that moment… and these words can not do it justice.

He vanished into the back, and returned with my powerbook, and the store Manager, who apologized for having me wait so long, thanked me for being so patient, (yes… he was among the store staff who I had looked right in the eye many times over the past two+ hours, and no… he never had said anything to me up until this very moment) and handed me a 10% discount coupon good for anything in the store.

I was happy to have my PowerBook back. I’m still dismayed at how difficult it had been to just drop it off and pick it up though. As I said earlier, it pales in comparison to my previous experience of Apple Repair, which despite being in the “bad old days” was a fantastic customer service experience. Especially in light of this recent experience, with an Apple Computer that is supposedly so much better than before. To sum it up:

1996 PowerBook Repair Time & Effort:
* 5 minutes of my time
* 2 days of Apple’s time

2006 PowerBook Repair Time & Effort:
* 4+ hours of my time (largely spent being actively ignored while in close proximity to Apple Store staff)
* 14 days of Apple’s time

The insult to injury: A 10% Discount, should I decide to reward this bizarre treatment with my money.

I suspect the store Manager was sincere in his belief that he was somehow giving me something valuable in exchange for my time. I can’t fault him for that, but honestly I doubt I’ll ever walk into that store again.. or ANY Apple Store for that matter. I have been into two of them in search of assistance so far, and I can say without hesitation that I have had dental work that was a far more pleasurable experience. Root canal? Sure. Apple Store? Not unless they let me suck on Nitrous Oxide at the Genius Bar.


It took me almost two months to get around to write the above. Mostly because, like my other rants about bad experiences, I wanted to have the time to cool off and look back without any raw emotion to cloud my judgement and have me end up sounding like a raving lunatic. I work in a service business myself and I know how tough it is to take a raving lunatic seriously. I’m a patient guy… probably too patient for my own good, as I probably could have created a scene at virtually any point in the above situations and accelerated the outcome. In hindsight perhaps I should have started being a squeaky wheel… I’m just not that kind of guy though.

Today however, I heard from that friend and client of mine, Sam Crutsinger, who related a tale very similar to my experience with getting repair service from an Apple store. Unlike me, he wasn’t patient, and in fact gave up on dealing with Apple and found service elsewhere. Here is his story as told to the Your Mac Life mailing list:

Date: Sat, 25 Mar 2006 18:42:52 -0600
From: Sam Crutsinger
To: Your Mac Life
Subject: Crap Service at Apple Store

Riddle me this Batman... I have a Mac that won't boot up. It's drives never
spin up. I've hit the mobo reset button, I've pulled the extra RAM and HD
and everything else but it's still not coming up. It lights up the power
light and the fans spin, but there's no startup chime, no HD whir, no caps
lock wink or anything that says "I'm alive!"

So I take it down to the Austin Apple Store at Barton Creek Mall to drop it
off. It's only a couple of months old but old enough that I'm cool with
repair instead of replace.

I get down there and say, "I need to drop this one off for repair. It's
won't come up." to which I'm met with "Do you have an appointment?"

Sigh.

Could someone please tell me what nimrod corporate weasel came up with that
system?

At first I thought it was cute. It was like they were trying to make
computer repair seem like a shi-shi experience. Now it's gotten out of hand.
Today the system is so NOT cute that I very nearly made a very loud scene in
the middle of the Apple Store about it. The only thing that kept me from
going off was the fact that before things reached "absurd," I'd already put
in my name and email address to see when the next available reservation slot
was open. If I could have gotten out of there anonymously I would have made
a speech to the masses.

The funny thing was that there were so many store employees in the room that
the floor manager had just instructed the sales kiddies to spread out evenly
so they could be more effective at standing there with nobody asking them
any questions. That was just before I engaged him with a rousing game of
"Take it!" "No" "Take it!" "No."

I told him that I just wanted to leave the computer. "It's dead.
[troubleshooting play by play] So there's nothing to diagnose. I just want
to drop it off and get a warranty repair going."

"The geniuses are the only ones who can check in the computer for repair."

That's when my blood hit about 212°F (That's 100°C for you people with your
fancy metrics.)

"So let me get this straight. I have to come back in FOUR HOURS just to drop
this computer off for repair?"

"Yes sir. The computer has to be entered into the system and that has to be
done by one of the people at the bar."

"NONE of all these people standing around can type the numbers into the
system for a broken computer?! You're telling me that you, the MANAGER,
can't just put it in?"

"No, it has to be the Geniuses."

"So I can't just drop this off and you'll fix it later."

He went off on some justification about how they found that this system
worked better because somehow having a "genius" do data entry prevented
Apple from losing systems so some such crap.

I said that I don't want to talk to a genius. I don't want to wait for 4
hours. I don't want to waste my time going through all this BS. I just
wanted to drop off the computer.

He actually said something along the lines of "This is the way it works
everywhere. Where can you just walk into a store and drop off a computer?"

My friend and I listed off several choices off the top of our heads which
seemed to genuinely surprise this guy.

After pushing a bit more the guy started getting cute and telling me that I
was welcome to leave the computer sitting by the bar and leave but anybody
could just walk off with it and they wouldn't be held responsible for
it...and I'd still have to come back in 4 hours to fill out the paperwork.

What the hell is wrong with Apple support these days? There's no reason for
this sort of thing to be going down. I can understand where it could be good
for dealing with the riff raff just trying to set up their email or learn
how to make their iPod reset, but to lump all of it together is absurd.
Apple needs to have a fast track drop off where you can just leave your
contact info and leave the computer and they can get to it when they get to
it. I don't need a hand job from a "Genius" just to drop off a computer. All
you need is a high school drop-out with a computer who can read and maybe
operate a barcode scanner. Actually, the literacy bit is probably optional.

Oh, and don't get me started on the ProCare that they brought up more than
once. If there's nobody to help your clients then there's nobody to help
your clients. Are you saying that if I had a card, a space would suddenly
open up?

So after getting all pissed off and feeling like I'd just been bounced from
some trendy night club, I went to CompUSA, an authorized Apple service
center, and dropped off the computer. I walked straight up to the counter,
waited as the one guy ahead of me was showing the tech his problem for a
couple of minutes, and then the tech said "Let me get someone for you." and
he called for backup. Another tech came out and took my info and then handed
me some paper work and I left. How screwed up is the world when **COMPUSA**,
the company with possibly the worst Mac track record in history, can just
take in a broken computer and send me on my way without a reservation?

A part of me would love to just find someone to make a reservation bot that
could go in and fill all their time slots every day with randomly generated
names and phone numbers. The reservation system needs to be either fixed or
destroyed.

--
Sam Crutsinger
Media Kingpin, TackyShirt
http://www.tackyshirt.com/
Training and Fun are NOT mutually exclusive


Ditto. Sam is right. Apple needs to have a look at their repair program inside the Apple Stores and fix it. The retail angle might be working for them in terms of sales, but in terms of SERVICE, it is just plain awful if this is par for the course. Technical Support is a channel into repair, but not ALL repair requires a “Genius” and an appointment. If CompUSA can figure it out, Apple should too.

Ironically, I used to have all service work (except for the specific PowerBook 5300 issue mentioned earlier) on my Apple gear done by a local reseller (Westwind Computing) who unfortunately went out of business last year. The reason for their demise? Apple going into retail of course. I could have walked in, dropped it off, had it fixed within days, and picked it up without delay. Their owner might have even taken me to lunch.

I don’t expect a lunch from the Apple Store, but I would hope that they would at least have their feces amalgamated.

User Interface Stupidity

Apple’s QuickTime Streaming Server doesn’t have much of a UI, but there is a fatally stupid design flaw in what little it does have.

You run the UI in a web browser. I use Apple’s own Safari for my web browser (I actually use several different browsers, but I use Safari for administration tasks for some reason). To get fresh data, namely stats on the “server snapshot” page, you need to refresh the browser. But here, have a look at where the geniuses at Apple put the “Disable Server” button… note the proximity to refresh button in Safari!

Tonight, while a good client was “live” and I was providing stats via iChat with their production staff… I accidentally clicked a few pixels too low and shut down the server. D’oh!

In my defence, I am test-driving a new mouse (an Apple Mighty Mouse btw) and my clicking is not as precise as normal, but still. They should put that “stop server” button somewhere else, don’t you think?

Murphy’s Law.

In 2002 we bought a company in the Bay Area… they were a Mac-only ISP, that (typical of the days) did a little bit of everything… hosting, colo, access, development, dial-up, DSL, T1s, car washes, prostitution… *anything* to make a buck.

Peter Lalor (blog roll) ran this company. One of his friends owned an office building in one of those canyons north and west of San Rafael. Peter & friend came up with a money-making scheme for said office building:

1. Pull T1 line to building.
2. Wire office suites with 2-pair from “server closet”, kludge up DSL service
3. Profit!

The building was so far off beaten path that margins were sky high. Tenants had NO choice but DSL from Peter for BIG$. (or Dial-up… yuck)
This is by far, the most profitable section of the Peter’s portfolio of “everything +kitchensink” businesses.

Peter’s business went near-death in the Great Collapsing NorthPoint DSL Disaster of 2001. Peter bails. Peter sells us the ISP part, and the dev biz goes to another company.

We really only want the hosting/colo, but promise to keep as much else as possible running. Over the months/years we scale back as clients move to other providers for access, dial-up, etc. I happily assist clients in these migrations as we are not really in the Internet Access business anymore. We pulled out of that right about the time when DSL was emerging. We look like geniuses in hindsight, but really we didn’t have the capital to acquire DSL infrastructure. We stopped offering Dial-up & T1 in 1997, and had frozen our ISDN business in 1999. We had decided to be JUST a hosting/colo operation by 2000. We knew how to do it, so we managed Peter’s customers’ Internet access for as long as it made sense, and they continued to pay for it.

This particular office building however is still in the shadow of DSL from Pacbell/SBC/ILEC-dujour. We keep making money on This Building. Our CEO won’ let me migrate This Building to a $newprovider. I understand… my role is Operational, not Financial. So, every time I am in the area I visit and check on equipment, stay in touch with clients, etc. Shortly after the acquisition, we replace the PILE of netopia routers in their “server closet” with a ($new +15grand!) Copper Mountain CE150 DSLAM that Peter had ebayed, but could not figure out how to deploy. Things are (mostly) good.

One year later….

Peter’s Friend sells the property.

Copper Mountain dies a deservedly miserable death.

Two years later…

Competing ISP puts wireless tower on ridge above canyon. Sends sales guy door to door in canyon offering wireless access at a fraction of the prices we charge. We lose 60% of our business in the space of two weeks. Our sales guy that managed the remaining accounts slashed prices to meet competitor’s. We go from making Big$ to barely breaking even. I shop for cheaper T1, get it down by 25%, we stay slightly above water.

Three years later….

Office moves, attrition, entropy… we’re close to break even.
I’m mentally prepared for the tipping point when we’ll make the call to shut it down.

Four years later….

NoCal has bad weather, huge power spike/outage hits This Building over this past weekend. DSLAM is unresponsive.

Relay status to support staff.

I call T1 provider, they confirm T1 is up. Hrm… I call the property manager and have him go to the “server closet” (which is a wall-mount rack over a sink in a dark, dank, smelly janitor’s closet!) and check the DSLAM. He says “it is fine” … I can’t ping it, I can’t telnet or get SNMP out of it. PM says “lights are on”… have PM powercycle DSLAM. No joy.

Relay status to support staff.

Monday…
Call the consultant we’ve used locally for help. No answer.
Ask our bookkeeper to provide me with the cost/revenue analysis so I can make a judgement call about whether to just throw in the towel on this business.
Email MGT team that pulling out might be best if we are at or below break even.
Speak to Cust Svc staff to start calling down customers and relay status of situation.

Relay status to support staff.

Appeal to NANOG for remote hands. Get somebody almost instantly, almost walking distance away (!). He goes over, tells me DSLAM power supply is dead (PM was seeing LINK light on DSLAM NIC…sigh.)

Relay status to support staff

Scramble for replacement power supply for CMCE150. Rare as tits on a Bull. Find two whole units on eBay, but sellers don’t respond to email. Search takes hours. Find some on “telephone.com” for $500 each, but they are closed for day (EST). Manage to find a used hardware reseller in PST that has three! They want $750 each. Talk them down to $400 based on east coast option. Buy two, pay for overnight shipping to San Rafael.
(Murphy’s Law of Replacement Hardware: Buy one, it will fail. Buy two, first will never fail.)

Relay status to support & CS staff.
Call PM on-site tell him to expect FedEx, and supply tracking #.

Go to sleep happy.

Tuesday….
Wake up to email from reseller: “We missed FedEx shipping deadline.”

NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO!!!!

Look at fax receipt and find out reseller is in Sacramento. (Actually, in a bit of irony, Roseville CA, location of company that built our current Seattle datacenter before going Tango Uniform in 2001.) While in shower consider flying to Bay Area, renting car, doing it myself. Run in-head cost/benefit and reject. Find way of getting stuff from Sacramento to San Rafael. Arrange courier, but as yet have not contacted reseller (they open at 9am PST).

Relay status to support & CS staff.

At least one client already gone.(A lawyer. There are NO worse clients than Doctors & Lawyers. grumble.)

Finally raise reseller at 9:50 (WTF!?) and tell them courier is coming. Please provide ship-from address. They email me the address… which is in SAN FRANCISCO! Arrgh!!??

Had I KNOWN that from the start I would/could have arranged pickup the PREVIOUS DAY! Grrrr.

Cancel Sacramento courier. Find courier in SF, arrange pickup, tell reseller to expect pickup within an hour.

Relay status to support & CS staff.

Call PM on-site tell him to expect delivery by noon.

Have meeting (& lunch) with IT staff of a big client. While en-route to post-meeting lunch, check office VM from cell… message from SF courier: “Ship-from site says FedEx already picked up package”(!) Manage to keep poker face on in front of important client, but consider murdering waitress with a fork to relieve near-explosive stress. Quietly relay to Sales VP situation… his face not so pokerish… more puckerish.

Get call during lunch from my #2 guy: “getting calls from The Building… status?” Me: “FedEx beat courier to package… down until tomorrow.” (while maintaining poker face for important client at nice eatery.) Muttering on other end of phone.

Important client leaves, and I talk with Sales VP on way back to office: “We’ll need to give these clients at least two free months to have any hope of keeping them.” He agrees.

Get back to office, find uber-apologetic emails from reseller. Play phone tag with them (I suspect they are using caller-ID to avoid me… letting all calls from WA area codes go to VM)

Relay status to support & CS staff.
Relay status to PM on-site.

Consider options for ritual suicide.

Bookkeeper FINALLY gets me cost/revenue numbers. With the Lawyer gone, we are now officially under water on cost/revenue.

Sigh.

Murphy’s Law: Whatever CAN go wrong, WILL go wrong.


Update: Wednesday, March 15, 2006

I call the property manager at The Building, and let him know that FedEx should be there at some point in the morning with the power supplies.

Relay status to support & CS staff.

At about 11 am I get a call from the property manager at The Building who tells me the FedEx guy just left, but did not deliver anything(!). Ahhh! I check the tracking on their website and it says “On Truck for Delivery 8:03 AM San Rafael”, which I relay to the PM. I hang up, start calling FedEx, and working my way through their Cust Svc system… making little or no headway.
Pull out some more hair… shave several more hours off my life…

At around 1 pm, the PM calls me. Says the boxes just arrived! We go through the replacement procedure, and the DSLAM powers up just fine. I am able to ping it from here, and my SNMP management console starts registering traffic… breathe big sigh of relief.

Relay status to support & CS staff.

All that remains now is:

1. taking a pound of flesh from the reseller for botching the delivery on more than one level. Trying to recover some of my wasted courier and shipping costs.

2. Planning for the eventual decommission of this site, as we are losing money on it now. We’ll probably give the clients there 90 days to source a new access provider. Then we’ll finally be out of the access business entirely.

Delivery Boy

I shuttled a replacement server up to Vancouver yesterday. Our old DNS server “willow” finally died. Since I live halfway there I drove it up. Everything went well except for two things.

#1: I can’t find my keycard for the Peer1 facility.

No big deal, I call the NOC and a guy comes down to let me in. I walk around from the door I usually go in on the east side of the building down to the loading dock on the north side. Of course I am carrying this 40lb server. Ugh. Not good for my just barely healed back. Then the Peer1 NOC guy locks us out of the loading dock, so we trudge up the loading dock ramps, and around to the SE corner of the building… uphill all the way. My back was really hurting and by the time we got to the elevator inside the building lobby the Peer1 NOC guy must have noted the pain on my face and volunteered to carry the server for a while. We arrive in the datacenter and I’m still in my “work clothes” and a gore-tex jacket. It is HOT in the DC. I hand him the server back (we had swapped again as he unlocked doors) and stripped off the jacket. Thankfully our little server enclosure (a wire mesh “hockey locker”) has an HVAC vent right above it so while I’m working I have cold air blowing on me.

#2: The damn server doesn’t fit in the enclosure!

This trend of making servers 1U high and as long as an aircraft carrier is just completely out of control. This box is a Dell server, and it is about 1″ deeper than the rack it is in. I end up having to stand in on its nose. Plus I have to carve off the RJ-45 cable boot in order to thread the cable into the deeply recessed jack. I guess I’ll talk to Peer1 about exchanging our rack for a different one.

So now my back is hurt again, and our server is mounted vertically.

Russian Roulette …with bombs.

How one software author’s unwise decision ruined my week.

Apologies for the long one, but it explains my lack of writing anything else this week.

Monday evening, as I was getting ready to take my youngest son out for what was likely to be his last Halloween (he’ll be turning 12 in a little over a month and 11 seems to be about the time that “kid stuff” starts losing it’s appeal) “trick or treat” with his friends, somebody pulled a trick on me that ruined my week.

Some history first though: We use a very nice mail server package called “Communigate Pro” by what used to be named “Stalker Software.” Communigate Pro (aka “CGP”) has a reputation for being fast, stable, and scalable. For the most part this has been true for us. We have had some issues with it though over the past four and a half years. We run CGP on several servers, since CGP has been used by several of the web hosting companies we have acquired over the years. The copy of it we bought for ourselves though has been the one that has caused us problems. It runs great for 50 weeks of the year, but for a week in August or September, and a week in December or January, it completely sucks rocks. The only way I can describe it is that interacting with CGP becomes like talking to a starfish.

I watched a show once that well illustrated at least one definition of the word “Relativity.” It showed how nature has made metabolism something of a clock, and that each species operates on a relative clock speed based on their metabolism. If you time-lapse film slow metabolism creatures like starfish, and then adjust the speed up to “match” our metabolic rate… the starfish look very active… zipping about the ocean floor, preying on urchins and other shellfish. Amazing really. Same goes in the other direction, slow down the film of a hummingbird and they start looking like any other bird. I guess to a Hummingbird, a human being looks like a starfish.

Well, for two weeks out of the year our CGP mail server’s metabolism slows to one of a starfish. It works, just at a truly GLACIAL pace. The Server and Operating system are fine (load is low, machine is responsive at the console, shell commands are fine, go figure.) This is obviously frustrating – for both us and our clients. The fact that it comes back like clockwork at certain times of the year is very odd. We eliminated all external causes (traffic, spam, etc) and Stalker support spent hours and hours trying to figure out what was wrong. The only suggestion they could ever come up with was “put a faster filesystem under it.” This error appeared in whatever version of CGP we ran, and I’m pretty sure that we tried them all, starting with 4.0.X, all the way up to 4.2.X (and this week, 4.3.X… but we’ll cover that later) but they all had that odd metabolism time shift appear twice a year.

Putting a faster file system under it usually cleared up the problem. As did switching platforms. We started on FreeBSD, moved to OS X (better threading), then up to OS X Server (on an Xserve); but also we jumped through all sorts of filesystem and bus technology switches, such as IDE, to SCSI, to various RAID setups, to eventually a 2Gb/s FibreChannel RAID array. Last summer when the starfish returned, on a whim (well, not a whim really, more a blind rage and pique of frustration since I wasn’t going to sink any more capital into filesystem improvements!!! Especially since they were seemingly NOT improving the situation!) I told my senior sysadmin to move the CGP directories to the internal IDE drive of the Xserve. Presto! The starfish vanished.

The server was back to it’s responsive, stable state. While I was happy with regards to that, since our clients weren’t angry at us, I was LIVID because all those tens of thousands of dollars we’d spent on hardware was a placebo cure for a real software problem. Stalker (now calling themselves “Communigate Systems”… aka CGS) had no explanation for this, and just sort of slinked away.

There is another significant wrinkle to this story, which explains why I was unable and unwilling to ride Stalker/CGS harder and force the issue into some sort of resolution. In November of 2004, CGS nee Stalker, made significant changes to their software licensing model, and jacked their prices up well over 5.5X their previous levels. Needless to say it was a shock to their customers. Prior to this date, their software was “expensive” but a relatively good value. (IIRC we paid between $8000 and $16,000 for our CGP licenses in 2000 and 2001.) Up until 2004 the core customer for Stalker were Service Providers such as ourselves. CGP had become something of a darling in the Industry press for being a solid performer and a far better value than absurdly over-done and outrageously expensive “Messaging Platforms” such as Lotus Notes and Microsoft Exchange. I guess this attention went to the head of Stalker/CGS’ CEO and founder Vladimir Butenko, and he began transforming CGP into one of those over-done and outrageously expensive “Messaging Platforms”. Hey, in some ways I can’t blame the guy… his core market – ISPs – had gone from niche-market players to a total commodity market with NOBODY making very much money, if any. Just beyond his grasp, and seemingly within reach was a cash-rich “Enterprise Market” with some dominant players showing real weakness. The astounding thing is the way he decided to get there: by actively pissing off their current customers and seeding them with confusion, fear and doubt. The existing customers, all ISPs, schools, and small businesses were angry. Stalker/CGS left no option for a “mail only” (no calendaring, groupware, MAPI support, VOIP support, SIP/PBX functionality, etc) version, and any continued use, other than the VERSION YOU ORIGINALLY BOUGHT would cost you a hefty sum in support and maintenance fees – 18% of purchase price, which in the new scheme was actually what you paid originally! So it was like having to buy your software again every year. Customers were livid, and the sturm and drang on Stalker’s support mailing list was out of control. Stalker’s CEO, Vladimir Butenko defended these new policies with characteristic Russian twisted logic and denial. I don’t know how to say “tough shit” in Russian, but that is what he did, albeit in far more diplomatic terms.

What he didn’t tell anyone at the time was that he ensured compliance with his new licensing scheme and inflated prices by inserting a “time bomb” into Communigate Pro. If your server thought it wasn’t properly licensed, it would cease to run at midnight UTC on some arbitrary date, and then, if re-launched would shut itself down ever 15 or 20 minutes thereafter. No warning. No coherent error code. No reason why. Bang. Boom. Off. Dead.

This was done without any announcement or warning. It add insult to injury, none of us customers had any idea which versions of Communigate Pro had the timebomb code in it or what the dates for explosion were. It was truly “Russian Roulette.”

Up until 2005, the standard refrain from Stalker Tech Support for any issue was “Please Upgrade to the latest version of Communigate Pro.” The support and sales staff frequently touted the benefit of “free upgrades” of their software. You got your value and return on your initial investment by always being able to stay current and get your bug fixes. We had changed versions via upgrade countless times, as we obviously had at least ONE big ugly bug, which unfortunately was never fixed. I don’t recall what version of CGP we were running when the license change was announced, but I knew that in February of 2005, when the (first of what I now assume are going to be many) CGP timebombs exploded we were running a version we weren’t apparently licensed for… despite the fact that we probably upgraded to it two months before while troubleshooting our latest visit from the Communigate Pro Starfish Mode. CGP servers around the globe all blew up at midnight UTC on February 1st 2005, including one of ours. Predictably the CGP support mailing lists, newsgroups etc also exploded with angry, frustrated customers. I called the guy at Stalker who we originally bought the software from and asked him flat out, “OK, tell me exactly what version of CGP we are allowed to run so that this timebomb won’t affect us again.” Bill, my senior sysadmin downgraded us to that version on February 1st, and life went on.

Later in 2005 our CGP Starfish returned, and that is when we tried the “move to internal IDE disk” trick which worked. I had not paid Stalker that hefty price for support and maintenance (or as they ironically call in their emails to me “S&M”) so I was in no position to demand that they admit this “starfish mode” bug exists and fix it. I was stuck at the version we were running for perpetuity. Such is the Kafka-esque world of software licensing. Instead I directed my staff to start evaluating alternatives to Communigate Pro. I didn’t want to be the victim of extortion to pay for the development of features for “Enterprise Customers” that we would NEVER use. Here is a great example: I was on the phone with a guy from Stalker/CGS and he was telling me how great their PBX/SIP/VOIP system was. I asked him “How do our customers call us if the mail server goes down?” I was answered by a very long silence… followed eventually by “Hmmm… never thought of that.” SMTP/POP/IMAP/Webmail… that is ALL I need thank you. So we looked at the expanding pool of products that were filling the void being left by CGP as it acsended to “Enterprise” status. We had narrowed the field to a small handful by last week.

Then we lost at Russian Roulette again.

At 4pm PST on October 31st, which is Midnight UTC, three of the 4 Communigate Pro servers at our facility exploded. Their timebombs went off and they all shut themselves down. My wife had to fill in for me as the Halloween driver (we live in a rural area, so I had planned on taking my son, and a few of his friends into town for trick-or-treating.) I spent the night hunched over my keyboard and on my VOIP phone (thankfully we don’t use Communigate Pro for our VOIP needs!) to my office dealing with the crisis. Based on past events, we very quickly came to the conclusion that it was the infamous Communigate Pro Time Bomb, and not some other issue since it happened at precisely the same time on more than one server, and we were not the only ones it was happening to. (Stalker’s mailing list, which is viewable on the web also was exploding with angry customers.) To get us through the night we rolled the clocks back on the CGP servers, and restarted them. In the morning we started the work of figuring out how to deal with this. I emailed Stalker trying to find out why, when they had told us that THIS version was OK for us, that it still had timebombed. I posted, and replied to other’s postings on the CGP mailing list, but my account was in “moderated” mode, and the moderator was obviously not paying attention (easy to do as that is a significant weakness of the CGP LIST module.) Vladimir Butenko appeared on the list, once again in his twisted Russian logic saying essentially ‘there is no timebomb, and besides you must be stealing my software since your server stopped working.’ Not exactly a confidence or trust building exercise in customer relations there Vlad.

After careful reading of the CGP website, I finally decided that our only course of action was to downgrade to version 4.1.8, which seems to be the last of the “free upgrades” and should run on our license key obtained in 2000. Bill figured he could downgrade the software, and restart the CGP service without causing much disruption to our clients. 4.1.8 went on, we restarted, and suddenly, without warning…

The Starfish Returns!

Our mail server software is once again, moving at the speed of a quaalude-soaked starfish taking a leisurely creep over the ocean floor. It is 7 weeks early, but the starfish is back… with a vengeance!

Great. Just what we need. A software vendor extorting us on one side, and clients angry at us for under-performing software on the other. My loyalty is with my clients, not the bastard that is holding the gun to my head, or the timebomb on my server as the case may be. I rally the staff and roll out a plan; we’ll build a new server from scratch, install a fresh OS and a new install of CGP 4.1.8 on it, move the data over to it and cutover the IP address. Based on our past experience, this should outwit the Starfish!

Thankfully a customer had just decommissioned a very nice Dual CPU/Dual Core Intel server with a built-in Ultra-SCSI RAID system, and we made him an offer on it that he accepted. The only problem with it was the drives inside were low-capacity. Thankfully we have stacks of Sun Stor-Edge Array’s in our backup system that were in an idle state, so we ripped out 6 36GB LVD Ultra-SCSI drives from one and packed them in the server, installed FreeBSD on it, and started rsync on a cross-over cable between it and our production mail server. Oddly enough this went pretty fast, despite CGP in “Starfish Mode” the OS and filesystem is thankfully quite responsive. System load went from 0.10 to 0.34 on the production server while we were syncing… while talking to the Starfish was unbearably slow. For example CGP’s web UI would take 15 minutes to click from page to page.

We cutover to the fresh box at around midnight on Tuesday/Wednesday, and things seemed ‘OK’… instead of talking to a starfish, it felt like talking to a sleepy dog. Movement was perceptible, but not exactly as swift as we had hoped. In past experience “starfish mode” would improve to reasonable performance in the wee hours of the night when the server was under lesser mail load. Since I was staying in my office and had nothing else to do, I vented about this situation to my online friends, discussed via phone with Russ Pagenkopf, the guy I run the Mac-Mgrs list with… ironically running on a Stalker-donated copy of CGP, which also quite ironically had also timebombed! Russ & I decided to cease running CGP on the Mac-Mgrs list server as soon as possible, and once he had it running again I posted to the list about that. I also answered people who were angry on the CGP list about what was going on with us, and some of them relayed to that list what I had said, both to them, and on Mac-Mgrs. The PR backlash at Stalker/CGS was gaining momentum. I think I managed to get about 3 hours of sleep that night.

Sure enough come Wednesday morning east-coast business hours our main server was back to moving like a starfish. I left my staff to handle the angry clients, while I swallowed my anger and called Stalker/CGS for tech support. I didn’t expect much, but luck was on my side and by chance a Director-level employee answered the phone (When our tech support queue gets busy, I pick up the phone too!) I explained our situation with CGP 4.1.8 doing this “glacial slowdown” thing (I haven’t called it “starfish mode” with anyone at Stalker/CGS to date.) I asked him if my long-time contact was there, and he said, “yes, he just walked into the office” so I said to catch up with him since he knew the full history of this almost 5 year old problem and I didn’t have the energy to relate it to him. After a few hours of troubleshooting (it took me 55 minutes just to get to the UI to change a password so Stalker support could access the server) I got a call from them. Three people, all director-level folks at Stalker/CGS were on the phone and making me an offer. They would give us a 90-day License for CGP 4.3.9 to let us load that one up and see if it would fix the “Starfish Mode” bug. I was too exhausted to say anything but “it is worth a try”…. They promised me quotes for extending the 90-day license within a day.

License keys in hand, I woke up Bill, our over-worked and underslept senior sysadmin and had him install the 4.3.9 version on our creeping starfish of a server and restart…. it seemed OK for about 30 secdonds, then immediately tailspun back down to starfish mode once again. It is obvious whatever this bug is, it has never been adequately addressed by Stalker’s coders and remains embedded deep within the current version, and probably in upcoming ones as well. The Stalker support guys were stumped, and fell back into random-mode troubleshooting again, suggesting courses of action which were either impossible due to not being able to perfrom them on such a slow moving system, or stuff they had suggested in the past – which we knew would not work.

I had a plan. It was a total “hail mary” play, but similar stunts had worked for us in the past with the Starfish. Nuke the box we had been running the mailserver on just days before… before the software timebomb exploded. Fresh install of this CGP upgrade, move the data over to it and cutover again. This may sound like what we just tried, and it does. Meanwhile I talked to MY director level guy and said, where ever we are with the proposed new mail system roll-out, hit the gas pedal and get ready to install and ramp it up ASAP! He brought me PO’s for gear and software, and I signed them. I wrote an apology to our clients about the situation, and posted it to our website. I grabbed my laptop and left my office for the first time in almost three days to get some fresh air, and food. I had the laptop as it seems that open wireless networks are everywhere now, so if they needed me at the office I could probably get on AIM or whatnot easily.

Bill finished the install and rsync work, and we cut over to the “old” mailserver around 5 PM PST on Wednesday and….

It worked. The starfish was back in hibernation once again, and the server was behaving “normally.”

I finished up some client communications, and basically passed out on my office couch a few hours later. I slept 12 hours straight.

So, at the moment I have 90 days to get a better mail system rolled out and running. I think we can get that done. We’ll probably build a fresh, old CGP 4.1.8 system to leave any clients that can’t/won’t move to the new system, so we’ll stay in compliance with Stalker/CGS’ looney license scheme, and perpetually avoid the Russian Roulette with Software Timebombs present in CGP 4.2.X and who knows what subsequent versions. We’ll probably NEVER get a satisfactory answer about the causes, or real cures for Communigate Pro’s “Starfish Mode”… but here is my hope:

Someday, it will return. Not to *our* server, but to one of these “Enterprise Customers” that Stalker/CGS so desperately wants to trade their current customers for. Some multi-million dollar CGP “Messaging Platform” cluster installations. They’ll have hundreds of thousands of dollars invested in hardware, and of course CGP software. Their mighty cluster will slow to an inexplicable crawl. They’ll spend massive amounts of time, and eventually money, trying to cure it. Vladimir will log into it and tell them “Put a faster filesystem under it”, so they’ll blow wads and wads of cash at exotic SAN architectures or the like. VP-level guys like me will lose sleep and in-the-trenches guys will loose even more trying to fix the problem of wrestling with a starfish. Then, some geek in the organization will be google-surfing phrases like “CGP slow” or “glacial communigate” and stumble upon this blog entry from who knows how many years past. He’ll pass it up the chain, and somebody will gather up the guts to call me. I chuckle and say “You spent HOW much money to buy this software from these idiots? What, are you NUTS?”

There, I just saved you the phone call.