burnout Understanding risk

Understanding risk is an important part of driving change.

Throughout my career, I have tried to take a risk-based approach to actions and decision-making. This understanding of risk became more important to me throughout my career, starting as a systems administrator, moving up to an IT manager and IT director, then as a CIO in higher education and government, and now as CEO at Hallmentum.

Understanding risk is an important part of driving change. If you understand the risks, you can decide which risks you can accept and which you cannot. In doing so, you avoid "analysis paralysis" where you continually evaluate options without actually choosing a direction. By taking a risk-based approach, some options become possible, and others become obvious.

The risk model

My standard model to evaluate risk is based on likelihood and impact, and I often reference this model when considering system risk. Let's apply these to a simple-to-understand model of an IT system that supports a business function. Consider these dimensions:

Likelihood

How likely will this system experience a problem?

As an example, consider a computer system that is supported by only one server, under someone's desk, running on building power, and without a backup—that system has a high likelihood for failure. If you find a server like this in your organization, you should be very nervous about when it will fall over, because its failure is a matter of when, not if.

In contrast, if you have a computer system that is supported by multiple servers in parallel, running in different data centers, using redundant power and cooling, with multiple levels of backups—that system has a low likelihood of failure. I usually don't worry about these systems.

Impact

If the system does fail, what's the impact to the organization? This might depend on what the system does, taking into account exposed private data or the reputation of the organization.

Let's say you had a database that tracked when public benches were last repainted. Maybe your facilities department uses this information to know when to schedule a touch-up job or a complete repaint of the bench. If you lost this data, the organization isn't impacted that much. The facilities folks would have to re-evaluate the benches. For many benches, this will take a while, but the organization will continue otherwise uninterrupted from this failure.

On the other hand, if your HR database of employees and salaries was irretrievably lost, your organization would be severely impacted. Someone in HR might be able to reconstitute the data from other sources, but it wouldn't be perfect. And depending on your industry, this is probably private data. Your organization could face damages and fines for the loss of salary information.

Building a risk matrix

The likelihood and impact feed into a risk matrix. I prefer to color code the matrix with red as the highest risk, blue as moderate risk, and green as the lowest risk:

Likelihood v Impact (3×3)
M H H
L M H
L L M

For other risk scenarios, you might also include a "critical" risk, such as when loss of life is possible:

Likelihood v Impact (4×4)
C C C C
M H H C
L M H C
L L M C

I use a similar risk matrix when I need to understand the risk of making a change. Not all changes carry the same risk. I know from my early career as a systems administrator that you can make certain changes to a running system without worrying too much. But other changes require more attention. Again, consider the likelihood that a system change will result in a problem, and the impact that problem would have on the business.

Likelihood v Impact

Have a backout plan

Consider timing

Have a remediation plan

Consider carefully

Coordinate across teams

Have management support

Probably a standard change

Just do it

Lowest risk

Talk to others

Evaluate the timing

Consider knockdown effects

But when considering the risk of business applications, you might need a different risk matrix. Should you continue to run that business application? Does it really provide value? Is it reliable? These questions feed into a different matrix that helps you decide what to do with business systems.

Stability v Business Value
Tolerate Invest/Continue
Retire Upgrade

Risk in the organization

At every level in an organization, you need to understand risk. Without a method to evaluate risk and make risk-based decisions, you can quickly move to a decision. Some options are obvious, others will require careful consideration and coordination with others.

Consider leading an exercise to explore the risk in your organization. I find it is helpful to start with common definitions. Jot down some examples of what defines high/moderate/low impact, and high/moderate/low likelihood. define what you mean by "high business value" or "low stability." Get buy-in on these definitions from the key stakeholders, then work with them to place applications or systems in each box. Don't worry about relative placement within the box. If you find yourself saying "This is a moderate likelihood, but it's on the high end of moderate," then just put it in "Moderate" and move on. The most important factor isn't that the likelihood is "moderate" but what you do afterwards to reduce the risk.

What can you do to reduce the risk of your systems? Technologists usually start with likelihood. What can you do at a system level to reduce the likelihood of a failure? Moving systems into a data center is one way to do this. Also add redundancy where possible, such as redundant power supplies on different power feeds backed by different uninterruptible power supplies, or move important data off single disks to a RAID or a SAN.

At the same time that you address likelihood, also consider the impact. How can you reduce the impact of a failure? For example, do you really need to store all of that data on the system? Is the extra data increasing your risk unnecessarily? If you can remove unneeded data from a system, you might reduce your impact.

Over time, you should repeat this exercise, and you should be able to demonstrate your organization reducing its risks. In the Likelihood v Impact chart, your applications and systems should move from red to blue, and from blue to green.