Hope Is Not Enough (by Scott Turkow)

Author Profile - Scott Turkow has 8 years of experience in the Enterprise Software space, primarily in Operations and Sales Ops roles. Scott is the Senior Operations Manager at Integrien Corporation, the leading intelligent systems management company that enables the predictable operation of mission critical applications. Prior to Integrien, Scott was with the Resource Management Software Group of EMC, which focused on the development and sale of automated network management products. A tri-athlete in training, Scott tries to be outdoors when he’s unshackled from his computer.
What would you do if your company’s most vital business service went down???
(A) Freak out, pull the fire alarm and run for the high hills.
(B) Calmly inform the executive team that it was a one-time event, an oddity – once in a lifetime occurrence that will never happen again because you and your team fixed it for good. And they needn’t worry about customer complaints due to service availability ever again.
(C) Confident in your ability to recruit superhuman IT professionals, the type that can map the human genome with little more than their brain, 2 hours, a pad and a pencil - you calculate the additional headcount you think you need to speed problem identification and resolution in the future.
(D) Pour more money into your current system-monitoring tool set. You assume the siloed tools that are not designed to prevent problems, the same tools that gave you hundreds of alarms to sort through to “help” identify the problem, are your best option to invest in for your company’s future of improving application performance and availability.
(E) Stop and think. You realize your IT environment is not a simple plug-and-play shop (although all users seem to think so). IT complexity continues to compound and you need to find a new approach to managing critical business services.
If you’re an avid reader of LoveMyTool, you probably (hopefully) answered E. If you selected another answer, please start blogging immediately. I’d love to read how it all works out. Natural disaster films are big in Hollywood, why can’t the tech industry contribute to our rubber-necking appetites. Or will it? And will you be looking away when that next app goes down?
Reading through this site, it’s clear that companies of all sizes and across many industries are focused on improving, or at least maintaining, business service performance and availability. Quotes from Commerzbank, Crayola, and Texas Instruments highlight this fact. However, a large majority of companies have not caught on.
In just the last year, several notable organizations experienced application and system outages despite having a solid team and monitoring system in place. These organizations pride themselves on satisfying their customer’s needs, but in the face of unforeseen issues associated with complex IT systems, they were not able to prevent these outages from occurring:
- Federal Aviation Administration (June 2007) – A cascading computer failure in the air-traffic control system caused severe flight delays and cancellations along the East Coast on June 8th. In response to the failure, the FAA rerouted the system’s functions to another computer in Salt Lake City which overloaded because of the increased volume of data. At New York's LaGuardia Airport, passengers experienced an average delay of four hours. If you’ve been to LaGuardia, you understand how painful this must have been for passengers.
- Research In Motion (2007 & 2008) – RIM’s BlackBerry wireless e-mail service has suffered at least three disruptions this year—on January 31st, February 10th, and February 20th. And don’t forget the lengthy outage in April of 2007. Your thumbs are thankful for the rest I’m sure.
- XM Satellite Radio (May 2007) – Service was down for nearly two days after a problem occurred during the loading of software to a critical component of XM’s satellite broadcast system, which resulted in the loss of one of its satellites. The Navy is still trying to convince XM that it would be fun to fire a missile at the satellite.
- Google (March 2007) – Google Apps’ Gmail service, which includes a 99.99 percent uptime commitment, suffered significant availability problems that were not declared officially solved for all users until the next day. Kind of makes me wonder what 99.99 % equals these days.
- NYSE (February 2007) – A massive stock sell-off resulted in a higher volume of trades than usual which put a strain on NYSE's electronic trading system and caused a significant slowdown. As a result, the extra time traders took to complete transactions translated to billions of dollars of lost stock wealth.
These are just a handful of the high visibility IT outages over the past year. If these companies, with all their resources and large IT budgets are having trouble, what are you guys and gals doing with smaller budgets and less resources (both of which are probably shrinking)?
Sure, I’ve thrown in a few clever (please, no comments related to this statement – allow me to keep believing that I’m at least mildly clever) comments into my writing, simply to keep you reading. Fact is, IT environments are increasingly becoming more and more complex. Sys and app outages continue to occur and can result in a loss of revenue and a decrease in customer satisfaction. Yet it seems as if very few companies are taking a new approach to address this issue.
With that said, based on what I read on LoveMyTool I’m optimistic others will catch on. It’s good to see smart companies looking at innovative software to solve their complexity issues. For the rest of you - identify your closest fire alarm and lace up those shoes, there’s a hill top with your name on it.

Recent Comments