Trust. It’s important. And a company typically instills a huge amount of trust in its IT department, particularly (as is often the case today) if that IT department is responsible for the operation of systems (such as web sites) that contribute significantly to the bottom line.
I’ve been at the helm of several IT departments that operated the product of the company: social networking or other web sites. And that’s when I learned how vital it is to “air the dirty laundry” internally throughout the company about the stability of the key systems. Too often, I have seen IT staffs that were reluctant to announce that the system was down, in the hopes that it wouldn’t be such a big deal if they could only get it fixed quickly. Of course, often a quick fix wasn’t in the cards, and users started to a) notice the outage; and b) wonder if anyone in IT was even aware of it.
My response to this is to insist on the airing of dirty laundry. Less euphemistically: timely, comprehensive, broad-audience notification of all system outages and impairments. Specifically, emphasize the following:
- Point out to everyone that it is major public visibility of every blemish that most effectively leads to a dogged “no tolerance” attitude (on everyone’s part) towards outages. Sometimes that visibility points out plain old human error, but more often, it points out areas where we’ve (collectively) underresourced or failed to plan appropriately.
- If few people are aware of the outages, we won’t get as much understanding of the need for the “fix the root cause” action, which may involve money or opportunity cost.
- “Out of sight, out of mind,” as the saying goes. Exposing the pain actually helps to get the pain fixed. Pretending it’s not there, for whatever reasons, tends to allow the pain to continue.
- Asking the operations group to adhere closely to a strict outage notification template encourages that team not only to provide full information, but also to actively assess customer impact, root cause, and process improvement for each and every outage.
- Every outage is a huge opportunity to identify the root cause, fix it, and lower the likelihood of that outage occurring again.
- Recording outages is the first step to tracking outages, discerning patterns, and ultimately fixing things that might get lost in the shuffle.
But why, you may wonder, is there reluctance to do this? Aside from the sheer work pressure (yet one more thing to have to do while embroiled in a crisis), it’s because no one really likes to announce one’s own failures to a broad audience. IT personnel hear (and can even intellectually agree with) all of the above reasons for full disclosure on outages, yet they often persist in, well, sweeping them under the carpet. It’s time for you, the IT leader, to be directive, and to put the following sort of written policy into effect:
Policy:
All unplanned outages must be reported immediately to the ‘outages’ email distribution list. The communication should provide the information listed below in this document. In the event not all the information required below is available at the time the outage message is sent, indicate that a follow up message, providing all of the required information, will be sent within 4 hours during normal business hours (M-F, 8-5). If an outage is not resolved within 30 minutes, a follow-up communication must be sent with an updated status. Such updates must continue every 30 minutes until the outage is resolved.
Unplanned Outage Communication Format:
- Issue/Action: what is (or will be) down (from a user perspective), when the outage began, and for how long, or an estimated restore time if the service is still down.
- Customer Impact: what was/will be noticed by the customer
- Internal Impact: what was/will be noticed by internal personnel
- Resolution: What was done to restore the service
- Root Cause: What was the true cause of the problem. This should be identified whenever possible, as soon as possible.
- Preventative Measures / Process Improvements: What steps are being taken to ensure this type of problem does not recur, or to allow quicker response if it does.
Once you put this policy into effect, expect to spend a fair amount of time fine-tuning the communications. The most common problem you’ll see is difficulty in crafting the language to be business-oriented (e.g., don’t do things like include traceback logs!), and in the drilldown to identifying the true root cause of the problem. Equally, expect at some point to hear the observation from your peers that “gee, the system seems to be down a lot more lately.” Increased visibility of system problems can get interpreted as an increase in frequency. You’ll need to stave this off from the start, by making sure that your business stakeholders are aware of what you’re mandating and why.
It’ll be potentially be a battle to recalibrate your team’s mindset towards greater notification, but it’ll be worth it in the long run. Remember what matters: focus everyone on the search for root cause and identification of process improvement on every outage.
[…] There’s no published record of system uptime and failures. IT departments that don’t monitor, measure, and publish their operational success rate probably […]