
Author Profile - Scott Turkow has 8 years of experience in the Enterprise Software space, primarily in Operations and Sales Ops roles. Scott is the Senior Operations Manager at Integrien Corporation, the leading intelligent systems management company that enables the predictable operation of mission critical applications. Prior to Integrien, Scott was with the Resource Management Software Group of EMC, which focused on the development and sale of automated network management products. A tri-athlete in training, Scott tries to be outdoors when he’s unshackled from his computer.
I hope you had your morning coffee; I’m going to get into some heavy stuff right at the top. Ready? Uptime is a measure of the time a particular computer system has been “up and running.” Now stay with me here. Uptime is the opposite of downtime, which is when a system is not operational… Whew, wipe that sweat from your brow and give yourself a round of applause, you’ve survived my non-teachings for the day.
Uptime is typically measured in nines: 3 nines give you 99.9% reliability, about 8 hours and 46 minutes of downtime a year. The “gold standard” is 5 nines - 99.999% reliability, which translates to a total downtime of no longer than 5 minutes per year. Gold standard uptime is usually reserved for telephone service, internet access, banking systems, and cable TV. And then there’s 6 nines, for military systems – not so fun when your defense system goes down, but even the military will accept 31.5 seconds of down time per year.
Clearly the more 9’s the better, but there is a “cost per 9.” Since massive redundancy is required to support more reliant systems, the cost/value balance typically falls in favor of being down for an expected number of hours and minutes. This is the “good enough” approach.
So when we see a bunch of 9’s with a % following, we say “gee, there’s a whole bunch of 9’s, that’s as good as it gets.” And when downtime takes hold, everyone seems willing to grin and bear it. It’s a part of the routine, and when a primary app goes down during peak usage we say – “gee, this must be that .1% they couldn’t figure out.”
After that exhaustive lead-in, to my question - when is “Good Enough” isn't Enough?
In short, when even a minute of downtime can equate to millions of dollars in lost revenue, embarrassing headlines, and several % points on the Dow. When you’re head of LAX customs and a single computer crash causes hours of delays for more than 17,000 airline passengers. When you’re a product manager for Google Apps and you have to credit users in an attempt to repair your product’s reputation.
Companies have long sought solutions to make their systems highly available; all have made purchases that get them closer to the kingdom of the 9’s. A far-off, magical place where downtime is no more than the .01 second difference between a gold medal winner (hero celebrated for decades on the cover of magazines and cereal boxes) and dead last (never heard of him) in a 100m Olympic sprint.
Even still, major websites, ecommerce apps, and email services go down. So why do we demand so little when it comes to business service performance and availability? And why do we assume 99.9% means something?
Most companies improve availability by implementing fault tolerant mechanisms to mask or minimize the impact of failures of the components and dependencies of the service. In this method, fault tolerance is achieved by implementing redundancy to single points of failure components. This is fine, if you’re still willing to accept downtime. For example, eBay is happy with its push for 99.9% uptime, but take a look at the 3 hour outage in June of 2007. You can’t tell me that didn’t make a few miniature doll house collectors angry (as angry as doll house collectors can get anyway) when they couldn’t make that extra bid on the hand-built 19th century Victorian reproduction (I almost had it, only 30 minutes of bidding left – then blackout).
Why is it that when I talk to companies that have invested heavily in the idea of 99.9% (or greater) availability – with redundancies built on redundancies and dozens of monitors pulling on 5 minute intervals – I still get someone asking the question: How can our applications be down if our servers, networks and software are all operating at 99.9% availability?
Let’s just think about the definition of uptime itself. If "up" means your server is running, even though your application is "down", your uptime stats take on a whole new significance - or lack thereof.
Most concerning are availability SLAs that are either not tied directly to business services or incorrectly assume that server uptime equals application availability. The threat to availability of web site and applications is breakdowns in the dependencies between components of a mixed-technology system. Most information technologies are individually reliable, but they depend on each other to perform. When handoffs fail or bottlenecks develop between software and hardware, networks and servers, or web services and the applications relying on them, services go down. From ERP to eCommerce, these interruptions put revenue and reputation at risk.
Business Services, and the infrastructure that supports them, have become increasingly complex. Many companies not tying SLAs directly to business service availability are instead deciding to monitor siloed segments of a broad and complex IT environment. This method does not provide a complete understanding of the health of the business service itself, only giving a pin-hole view to a massive landscape. Many companies think that if they have a bunch of pin-holes (dozens of monitors) that they’ll bring the picture into view. Unfortunately, traditional silo based monitoring systems can’t provide a view into the health of the business service because they can’t automate (correlate) what they don’t understand (business service behavior). Hence, no true 99.9% availability.
What scares me is that some IT groups attempt to improve SLAs and their understanding of business service behavior by manual means. They manually search through large numbers of events to try to determine the abnormal precursors to problems that affect their major business services. If your environment is any bigger than 50 servers, it can’t be done via manual methods (cost effectively anyway).
Availability from a Business Perspective
Fortunately, solutions are available - solutions that automate the entire process. Software that learns the normal behavior of every metric being collected through your monitoring infrastructure, even accounting for seasonal and business changes automatically. In addition, they automatically correlate metrics/alerts and produce problem models that allow your team to be more proactive and actually get ahead of problems before they affect end users.
Sounds like nirvana, but software like this is already available and installed at some of the world’s biggest companies – none of which were mentioned in this article. If any of this speaks to you, it may be worth your while to review products that look at the business service as a whole, with the prime goal of ensuring availability, not just 99.9% of the time, but 100% of the time. Granted, problems are inevitable, but they don’t have to impact your business services. Good enough is only good enough until you have that significant event that makes you rethink everything. Then you spend more trying to incrementally improve your 9’s when you should be grabbing a hold of 100%.









Recent Comments