Once, when I was but a wee IT Professional, I visited the company data center to perform some standard maintenance on one of the servers. In preparation, I brought over the “crash cart”. It was a standard crash cart with a monitor, a keyboard, and a mouse.
I plugged the monitor into the rack’s (very old) UPS so I could get started. My heart stopped The whole thing had gone dark and I could hear the internal fans slowing to a stop.
I had just, unwittingly, overloaded the UPS. The entire network was out of commission for the rest of the day, and well into the evening.
I unplugged the monitor and reset the UPS. I was so relieved when the lights blinked on, and the fans spun back up. Then got started on the long and painful task of bringing everything back up.
The worst part was that this was only my first week on the job.
When I finally made it back to the office a few hours later, the company’s CIO was irate. That was the first and only time that I’ve had a boss yell and swear at me while literally banging his hands on the table. It was intense. I was sure I was going to get fired. And yet…
The next day, he was let go. And I was not. In fact, I received compliments for my speed and dedication in getting everything back up so “quickly”.
For a while I felt bad, and didn’t really understand what had happened. After all, I was the one who’d plugged in that damn monitor and caused the massive network disruption.
Later on I realized that that wasn’t the real issue.
You see, the UPS that I’d plugged into was old enough that there were no management interfaces or visual displays. There was no way to track the load that the UPS was under, and the company’s unspoken “policy” had been to continue adding devices, assuming that the UPS could handle as many devices as they could fit.
There were a lot of managerial mistakes made, but one of the significant ones was a lack of appropriate risk management.
Understanding Risk and Developing a Risk Register
What are risks? They are potential events, with some probability of occurring, that would have a negative impact on your project, initiative, department, or organization.
There are a number of ways that you can deal with risks. For each risk, you can create a Risk Register, which is the subject of this post.
The Risk Register is a document that outlines various elements of a risk as well as your planned response. Specifically:
- A description of the risk
- The potential impact of the risk
- The likelihood of occurrence
- The degree to which you would be impacted
- A specific action that could trigger the risk
- The team member responsible for addressing or responding to the risk
- The response plan
A Description of the Risk
Describing the risk can be in two parts:
- A shorthand name (i.e. “Data center power outage”)
- A detailed description (i.e. “The data center servers and network devices lose power due to a localized event.”)
The Potential Impact of the Risk
For this section, you’ll want a detailed description of which processes, functions, projects, stakeholders, etc. will be affected if the risk comes to fruition. For example, a loss of an essential server may impact payroll, HR, or billing processes. These should be enumerated in as much detail as possible.
The Likelihood of Occurrence
When estimating the likelihood of a risk’s occurrence, you can use a qualitative measure (i.e. Low, Medium, High) or a quantitative one (i.e. 25% chance). Whichever measure you choose should be based on accuracy and context.
The Degree to Which You Would be Impacted
The degree to which you would be impacted states how “bad” it would be. For example, having an essential server go down unexpectedly would probably have a much greater impact than a printer running out of paper.
This too can be measured qualitatively or quantitatively. When measuring quantitatively you may use a number of metrics. The most popular metric is probably an associated dollar cost.
The Likelihood of Occurrence and Degree of Impact measures can be combined to help you sort risks. You can create a two-dimensional matrix to help you sort from High-Likelihood and High-Degree to Low-Likelihood and Low-Degree. Doing this helps you prioritize and determine your risk mitigation strategies.
The Specific Action(s) that Could Trigger the Risk
This section should answer the questions:
- How do we know that the risk is about to be realized?
- How do we know that the risk has been realized?
Part of the problem with the “risk mitigation” story at the beginning of this post was that the company did not have any warning systems in place. There was no way to tell when the UPS was reaching capacity, much less a system for proactively alerting administrators to the fact.
A proper Risk Management strategy would have had those systems established, and used the reporting and alerting mechanisms to answer the two questions posed above.
Who is Responsible for Addressing or Responding to the Risk?
You can be proactive and combine this part of the risk register with the previous one. Determine who is monitoring for the risk, and who will address it if it occurs. Sometimes this may be the same person, but many times it may not be.
One of my early IT jobs was working in a large data center full time. One of my job duties was to monitor the states of thousands of servers, and to alert the appropriate contact point if an event occurred. Very rarely was I responsible for responding to the event personally.
The Response Plan
There are a number of ways that you can respond to risks. These methods, and brief descriptions, are:
- Mitigation – This means that you are prepared for the risk and you are able to reduce its impact. For example, having redundancy across several servers that host your essential LOB applications.
- Prevention – This means that you reduce the likelihood of the risk occurring. In response to the fiasco at the beginning of this post, the company implemented a policy that technicians couldn’t work on data center servers during business hours. That’s an example of both risk prevention and mitigation.
- Transfer – Risk transference involves putting off the risk to another party. Insurance is an example of this. Having adequate contracts is another. For example, if you have a service contract with a company that they will respond to your issue within 2 hours, then you have transferred a portion of risk to them. They are obligated to help you resolve issues, and if they are tardy or careless in meeting that obligation, there could be financial repercussions for them.
- Acceptance – This means just dealing with the risk of it occurs. This is generally reserved for risks with either an extremely low likelihood of occurring, a very low impact, or both.
When creating the response plan, don’t just state the way that you will deal with it. List specific actions that you (or the responsible party) will take. Specificity is key here. Provide step-by-step instructions. My suggestion is to be so meticulous and organized that the least-informed team member could carry it out without any difficulty.
The very worst way to deal with a risk is to ignore it. It’s much better – for the company and for your peace of mind – to create a comprehensive plan for identifying and responding to risks. That way, when they occur (because they will at some point) you will be ready for them.
1 thought on “Risk Management for IT Leaders”