Thoughts on Resilience

16 May 2020

Failover Conf: My Intro to Resilience Engineering

I attended the all-virtual Failover Conf a few weeks ago and spent the day learning a ton about resilience engineering concepts. The conference was very informative and a valuable introduction to resilience engineering. While many of the concepts were familiar to me both from my software engineering and security backgrounds, some concepts (especially those based in cognitive theory) were new to me. Resilience is a normal topic for me, as I regularly remind my customers of the importance of testing their backups and validating their disaster recovery plans. It’s normal for me to spend much of my day introducing folks to some of the worst possible scenarios that could happen to their critical business operations. In general, it’s easy to just never consider how long it would take to restore operations if the building we work in suddenly becomes unsafe to occupy or if an earthquake impedes access to the road out of town. We tend to avoid thinking about these things at all because they aren’t pleasant. It feels like we’re doing too much. We’re spending too time planning for what should never happen.

No One Remembers the Crisis Averted

Heidi Waterhouse stated that:

No one remembers the crisis averted.

When we do resilience very well (because resilience is something you do - I agree with J. Paul Reed on this!), the services we provide may degrade, but they hopefully degrade in such a way that minimally affects the users or systems relying on them. Ideally, services fail in a way that is totally undetectable to the consumers, gracefully failing to new instances with minimal or no downtime. This kind of graceful failure isn’t achievable without automation.

Automation

Much of the automation front-loading out there is “glue work.” I get the impression that some folks feel that glue work is a waste of time. “We could be implementing ✨features ✨.” To me, it’s an investment. We will be able to implement those features and all others more safely and efficiently if we do this work upfront. In addition to automation, thorough testing and accurate documentation can also contribute to efficient deployments yet are often ignored or don’t get nearly the time and attention they deserve. It is worth mentioning that adding automation isn’t magic and it does add to the complexity of the system. This means different failures than the types we knew about before we implemented the automation.

In a similar manner to how we don’t see a crisis that was averted, I suppose once automated deployments and recoveries are humming along, we don’t really notice the wasted time and headaches we’re avoiding. Amy Tobey mentioned in her talk that “glue work” is undervalued, yet so much work can’t happen without it. Whether it’s SecOps or DevOps, the glue is important. I think the glue might be the special sauce that some teams are missing. There’s nothing like that liberating feeling when you realize the vendor system which lacks an official integration to your SIEM has a usable (yes, API usability does matter!) REST API you can leverage to wrangle the data and get it into your SIEM.

Embracing Failure

Something that dawns on me as I write this is that it feels as if DevOps has a more forward-thinking view of the value of glue work, especially testing and automation. There are specific roles carved out to do this work. Time is set aside regularly to enhance the systems and processes that support how work is accomplished. I think that in the security space (more specifically security operations and incident response), we’re still learning how to enable faster feedback, better telemetry, and smoother incident response. I see a focus on learning from security incidents (or at least filling out required post-incident documentation) but less of a focus on reviewing failures that end up as near misses or taking a high level view of incidents over time to discover failure patterns or other details that only become apparent in the aggregate. As an industry, we’ve come a significant distance toward understanding what really enables us to do our jobs as defenders but I think we still have work to do toward valuing continuous improvement of our processes and systems, embracing experimentation, and learning from failure.

One of the tools for managing organizational trauma that Matt Stratton mentioned in his talk was using something like “Failure Fridays” or another vehicle for regularly-occurring planned failure injections. As a security person, I absolutely love this. Tabletop exercises are a fun, low-pressure way to build muscle memory but at some point we have to validate that the procedures outlined in our playbooks actually function in the real world. Lots of things can change between the time a playbook is written and the time it’s actually needed. Accurate, easy to follow documentation is a solid foundation. Amy mentioned this in her talk in the context of “cognitive capacity.” Incident response playbooks shouldn’t be written for folks who are at their best. Remember that a sleepy analyst might have to follow those procedures at 3 AM. Instructions should be clear and concise. They should not leave any opportunity for the responder to take a guess about what they mean during an incident.

Socio-technical Systems

Amy’s talk also opened my eyes to the concept of socio-technical systems and this idea made such perfect sense to me. People, processes, and technologies come together to affect how we actually deliver value from these systems! Amy makes the bold statement that root causes don’t really exist. We may need to use that language because it’s comfortable to us but typically what we refer to as the “root cause” of an outage is actually just the last thing to fail. Her stance on human error is similar. When we get to a place where one person pushing the wrong button can bring down a system, what really got us there? Is the button poorly designed? Is it 4 AM and the button pusher is responding to a call after being up late with a sick child? What’s really happening? The underlying issue is more likely to be a failure in process or design than a failure in the human. Heidi Waterhouse said the following in her talk and I tagged it in my notes as MOST important:

I don’t want to have default behaviors that put people in peril.

That statement hit me really hard. Heidi makes the extremely valid point that we need to practice both risk reduction and harm mitigation whenever we build systems. We should do all we can to reduce risk. Where we cannot, we must do our best to reduce the harm that will befall people if bad things happen. One thing we can do to fizzle a disaster into something less terrible is implement functions to prevent issues like runaway APIs or service outages from overloading or crashing our systems. Another capability might be automatic recovery so an engineer doesn’t have to get paged at 2 AM to restart a failed service but can instead review the details of the service failure and recovery in the morning. I love this solution because it helps to safeguard the availability of the service as well as the well-being of the human who is responsible for maintaining that service.

“Root Cause”

I mentioned “the last thing to fail.” In the context of defining disaster, Heidi says in her talk that we often hear “it was all going fine until that last thing.” Again, we may believe that we really found the “root cause.” Unless we dig deep and examine the whole socio-technical system, we are probably just going to be referring to the straw that broke the camel’s back. It might have been the last update to the application or an influx of traffic due to popularity. Whatever it is, we “fix” it. We fix that straw but the next straw is going to come along and topple the system again if we don’t address the underlying issues contributing to its fragility. A related bit of terminology I hadn’t heard: dark debt. It’s the technical debt that we take on for all the complex interconnections and dependencies of our systems. This kind of debt could, I imagine, be magnified significantly by a lack of documentation or transparency regarding system dependencies. We have to plan for outages and issues of all kinds and build our systems to flex under the weight of the unexpected, whether that is a major increase in traffic or the outage of an external API. Matt also mentioned that when failures do occur, we shouldn’t stop at blameless postmortems. The next step is to ensure that we are sharing postmortems across teams and generally discussing them. The storytelling is part of the value.

Adaptive Capacity

J. Paul Reed’s talk, “The Halo of Resilience Engineering,” delivered some great terminology and concepts, including adaptive capacity which was a new concept to me and naturally came up in multiple talks. Adaptive capacity is the ability to adapt to change. Where robustness describes characteristics that make a system resistant against known failures, resilience describes being able to deal with unknown failures. From what I understand, adaptive capacity would be the key metric when we want to measure resilience.

Closing Thoughts

Matt’s talk set the tone of the day for me because it got me thinking about how important people are within resilience. Before Amy’s talk, I didn’t have the language to describe some of these ideas such as socio-technical systems. Amy’s two reading recommendations were Field Guide to Understanding ‘Human Error’ by Sidney Dekker and the paper “How Complex Systems Fail.” (I just read the paper and it was short, sweet, and to the point and a great follow up to Failover Conf’s content.) A constant theme I heard in almost every talk was a subtle emphasis on setting expectations. This could take place in a number of areas:

People: We set expectations through common ground and the predictability of how we interact with our teammates. Communication norms, for example, are a critical part of the culture for an IR team.
Availability: We set expectations when we define our SLAs and define standards and goals for the services we provide. We try to be realistic with what we promise to deliver to our users.
Planning: We set expectations when we estimate and plan our upcoming work based on what we know we can achieve. We ensure that our adaptive capacity is always considered in work planning. We may be able to shed some tasks when cognitive load is impacted but we typically can’t drop what we’ve promised to our customers.

TLDR

Resilience is something you do. It’s not something you can buy and install on your network or in your apps (sorry!).
Although some failures can’t be predicted, plenty of them can. We should focus our energy on minimizing the harm that could occur when these known failures happen.
Encourage resilience early on in development and use patterns that handle failures such as kill switches and circuit breakers.
Build flexible systems that are capable of adapting in response to failure/traffic spikes/etc. with minimal interruptions to service consumers.
Being prepared for disasters and having solid recovery processes can lower deployment risk.
Predictability is important. This includes inter-predictability, or the ability of team members to understand what the others are likely to do.
Our ability to work together as a team can directly impact our ability to recover from an incident.

From Heidi’s talk, this was my most important takeaway and possibly the most important message I took away from the conference:

Failure is inevitable. Disaster is not.

We can’t prevent every failure but we can prevent the types of failures we know about from turning into out of control dumpster fires when they occur.

Resources:

📚 Here are some resources I am reading, have read, or plan to read to learn more about resilience engineering:

How Complex Systems Fail
Ten Challenges for Making Automation a “Team Player” in Joint Human-Agent Activity
Common Ground and Coordination in Joint Activity
David Woods CSEL Resilience Short Course
Field Guide to Understanding ‘Human Error,’ by Sidney Dekker