There has been a lot of debate recently about Impact Tolerance. A friend of mine was baffled by the fuss – surely it is simple? The inner consultant instantly answers – Yes. And no.
And of course, it is conceptually simple. Impact Tolerance is like Risk Tolerance in that you are trying to define what an allowable variation in an outcome is, but in this case the outcome is the impact on customers and clients that disruption to your service provision could create, and probability is not a consideration – you assume that disruption happens. The big question is – how much pain are you, or rather your clients and customers, prepared to put up with?
The financial regulators in the UK came up with the concept, and published their thoughts in their well-received discussion paper DP01/18 in July 2018. They want financial firms to apply impact tolerance limits for disruption to their critical service lines – and to demonstrate that they are able to stay within these limits. In the backs of their minds they have the highly publicised IT problems that RBS suffered in 2012 and TSB suffered in 2018 (amongst others).
But there are plenty of reasons why people have difficulty in making sense of this. First of all – what is a Service Line? It is not, contrary to what some IT people tell us, a cluster of systems, nor is it a product line like ‘Current Accounts’, nor a channel, like ‘Mobile Banking’, nor an internal process like ‘Payroll’. A service line is something that delivers an outcome for a customer – in other words it is something the customer recognises as a need being serviced – an “I want…” like “I want to withdraw cash", or “I want to complete the purchase of my house”, or “I want to make a payment”. Product groupings and channels are simply different views of these service lines. It is surprising how many businesses get bogged down at this first hurdle. The key is to focus on customer need, and not to align service lines to IT or functional structures.
The next question is what makes a service line critical? The regulators tell us that we must consider the impact of disruption on our customers, not just markets (which had been the main focus of the PRA and the FSA before that), nor the firms’ profits. Mostly, potential impact on customers is measured by the length of disruption, but the FCA has become increasingly proactive in getting firms to think about the harm caused to customers, especially vulnerable customers. Meanwhile, firms themselves have long since realised that the financial bottom line isn’t necessarily their real bottom-line – reputation is. A firm can often survive crippling losses, but not the loss of reputation. In assessing whether a service is critical we need an impact matrix that is consistent with (or is) the firm’s risk impact matrix. That matrix needs to have categories of impact that cover disruption to customers and markets, harm to individuals (which can be further broken down and measured by more granular criteria such as stress, inconvenience, personal financial loss, inability to access critical resources or support etc.), as well as damage to the company’s reputation and finances.
We need to identify and prioritise service lines and then, once prioritised, map the key process steps and critical underlying dependencies. Why? Because the purpose of the exercise is to make sure that the service line is resilient, and that means making sure that the assets on which the service depends are resilient. The point of prioritisation is to reduce the work you have to do and focus on what really matters. This is something people need to remember when they are undertaking this work because it is all too easy to get lost in inflexible and unnecessary analysis.
Coming back to Impact Tolerance…this should be set at a Service Line level. It is serious stuff. A breach of Impact Tolerance should be akin to a breach of Risk Appetite, reported at Board level and visible to the regulator. Lower level warning triggers, leveraging risk (and impact precursor) indicators will be important management metrics to help avoid potential breaches.
Impact Tolerance measures need to be set for those impact categories that made the service critical. Up to a point this is not too much of a problem – most firms these days monitor negative media and complaints, for example, and it would be a poor show if a firm could not make a good estimate of the losses it suffers as a result of a disruption.
It is also relatively straight-forward to measure the duration of a disruption (although the downtime of systems is not necessarily the same thing). The Bank of England’s examples of impact tolerance and their informal discussions encourage this, saying that ‘Time’ is the critical factor. From their point of view, it is. They are concerned with the disruption to markets, where the longer it goes on the worse it gets. Because of this and because suitable metrics usually already exist, many firms plan to do just this.
However, there is a problem. The duration of a disruption is only a proxy for inconvenience, harm and irritation caused to customers. In some cases, it is not a matter of how long a service is down but that you don’t want the disruption at all. There’s a big difference between a cash point being down for a day and a house purchase failing to complete in time. Time is a poor proxy for real harm and distress, especially to vulnerable customers, which is more the focus of the FCA.
Ideally, we would directly measure distress and harm, but this is not at all easy. That doesn’t mean we shouldn’t try, and at least work towards that goal. We can infer harm, but first we need to understand who is using which services over what channel, and to make some assumptions about the demographics of users, particularly potentially vulnerable ones – for instance, taking into account the financial vulnerability of customers of pay-day loans, or the preference of the elderly for branch support. This may require some creative thinking about how the impact is defined – for example, how will you measure stress felt by a customer – can a proxy be developed using a combination of outage time and number of customers affected, customer profile, and type of customer need that could be impinged? Leveraging the approach and thinking behind enterprise risk assessment criteria may be a source of insight here.
A final challenge is where to set the thresholds. We need to understand what normal levels of service look like, and what happens when there are disruptions. By definition, the levels at which impact tolerance is breached are rarely reached; this is the final threshold that must not be crossed, but it is part of a hierarchy of metrics and indicators baselined at levels that equate to an acceptable target level of performance and/or impacts. Some careful thought is needed to get this right and aligned to Risk Appetite. The pay-off is better early warning surveillance and reporting capability focused on customer impact, rather than reporting after the proverbial has hit the fan (which media and complaints monitoring can be too late to help with).
To be honest – I don’t find all this as easy as it ought to be. Most businesses and business metrics are organised around functions, and what the regulator has asked us to do is to prioritise and measure in line with the customer’s most pressing needs. All the components seem familiar, but it needs a pivot in our thinking, and we have to be careful not to assume that what we are familiar with is the answer. But what I am already seeing in the places where I have worked is that there is the potential to generate much better management Information and smarter Resilience by getting this right.