What is the history of incident management?
If you’re an SRE, you may be so involved in the day-to-day work of reliability management and incident response that you don’t take the time to step back and ask this question. This is a shame because SREs did not invent incident management concepts and strategies on their own.
Conversely, the ways in which SREs think about incident response, structure of incident management teams, and prioritization of incidents owes a lot to incident management strategies developed in the real world decades ago. To fully understand what it means to be an SRE today, you need to appreciate this deep history of incident response.
So, let’s take a look at this history, and examine where modern concepts of incident response arose.
Historic Incident Management Problems
Societies have always had accidents, of course. Fires, floods, infrastructure collapses, and similar crises have been occurring for thousands of years.
But for most of history, humans have lacked an effective, purposeful way to manage these kinds of accidents. Response efforts were ad hoc, and their effectiveness owed more than a little bit of sheer luck.
Special challenges included:
- Lack of effective and consistent communication between stakeholders.
- Divergent organizational structures make it difficult to identify leaders, coordinate response efforts, and delegate tasks.
- Inconsistent response strategies.
- Various methods for assessing the priority of accidents.
Historically, organizations might have been able to handle incidents well enough if incidents required a response from only one small group. But the more stakeholders involved, the more difficult it is to respond quickly and effectively.
Extinguishing Fires: The Birth of ICS
Things started to change for the better when stakeholders started thinking of better ways to put out fires – quite literally.
By the 1960s, California fire chiefs realized that they struggled to respond effectively to the wildfires that broke out each summer. Each year the fires were worse than last year, with more land burned and more buildings lost. The 1970 Laguna fire brought matters to a head and sparked a new approach to incident response for fire agencies.
After assessing what went wrong, the fire chiefs decided it wasn’t a shortage of equipment or people. It was poor coordination between the different extinguishing devices that responded to the fires. In the absence of a clear chain of command and an organized approach to fighting fires, agencies have struggled to deploy their resources effectively and quickly.
To fix the problem, California fire chiefs developed what became known as the Incident Command System, or ICS. The ICS defines an incident response hierarchy, with the incident leader at the top. It also defines several categories of incident response operations, including operations, planning, logistics, and finance. It establishes a consistent set of terms that stakeholders can use to describe their actions during incident response, facilitating clear communication.
Although the ICS was initially designed for firefighting, it has become the de facto standard for organizing incident response strategies of all kinds.
From ICS to NIMS
The history of incident response does not expire with ICS. A new chapter began in the early 2000s when the US federal government developed a more comprehensive approach to incident management called the National Incident Management System, or NIMS.
NIMS was born in the aftermath of the September 11, 2001 terrorist attacks, which emphasized the importance of effective communication not only between different agencies of the same type (such as fire departments), but between completely separate organizations. To achieve this, NIMS expanded on the principles of ICS.
In addition to adopting most of the Incident Command principles and practices listed in the ICS, NIMS includes standards for coordinating resource allocation. It also adopts the concept of an emergency operations center, which in some ways is similar to a network operations center.
In some ways, NIMS is similar to a compliance framework (although that’s clearly not what it is). It includes fourteen management principles, which are similar to compliance controls, that organizations must implement in order to manage incidents using the NIMS approach.
Conclusion: Incident Management Today
Extinguishing wildfires and responding to terrorist attacks is obviously very different from dealing with data center failures or deploying a buggy application. ICS and NIMS are not designed to specifically engineer site reliability or IT teams.
However, the impact of ICS and NIMS on the way SREs think is clear enough. Terms such as “incident commander” come from these frameworks. So do concepts such as shared accountability for incident response operations and the importance of involving all stakeholders – not just technical teams – in incident response.
ICS and NIMS may not be familiar acronyms for most SREs. But they may be because they are the historical sources for the incident management philosophies that underlie SRE’s work today.