Incident management: Roles to accelerate teamwork
Incidents and outages are inevitable when working as an engineer. Whether it’s a small startup or a large-scale infrastructure, there is always something that triggers that high-pressure emergency. Managing incidents can be especially difficult when you are new to the team. However, depending on several experienced engineers would eventually become an obstacle to the growth of the development team.
There is also the issue that in case of emergency, engineers can be focused on resolving the issue and care less about aspects such as communicating and leaving evidence. Lack of communication could cause a delay in decision-making. Missing information of occurred events loses the chance for the team to review their actions.
The ability to manage an incident will directly affect the business. Whether you are an experienced member or a newcomer, it is important to make sure that everything is covered during the incident. The following are common roles introduced in Practical Monitoring by Mike Julian and Site Reliability Engineering, that many can contribute as part of incident management.
Communication liaison
This role communicates status updates to stakeholders, whether they are internal or external. In a sense, they are the sole communication point between people working on the incident and people demanding to know what’s going on.
Julian, Mike. Practical Monitoring (p.63).O’Reilly Media.
The communication liaison is in charge of updating the status of people who are not working on the incident. (People such as stakeholders, managers, and other team’s engineers.) This role is also responsible for preventing external parties from interfering with the engineers who are actually working to resolve the incident. It is important to let the engineers focus on their work, but also inform the facts to others so that they can decide what can be done on their end.
A common list of facts that need to be reported is as bellow.
- Summary of what is happening
- Impact (user, revenue, security)
- Trigger of the incident
- Root cause
- The incident occurred time (or duration)
- Current status
- Estimate recovery time
Scribe
The scribe’s job is to write down what’s going on. Who’s saying what and when. What decisions are being made? What follow-up items are being identified? Again, this role should not be performing any investigation or remediation.
Julian, Mike. Practical Monitoring (p.63).O’Reilly Media.
The scribe is responsible for keeping track of everything related to the incident. In some cases, the person acting as scribe may also act as the communication liaison, but the scribe's main responsibility is to write down every detail as evidence of each action. This document will especially be useful when engineers need to check what has been already done or use it for postmortem after the incident is resolved. Write down everything in detail with an accurate timeline(who, what, when, how). If the immediate team is working in multiple groups, make sure to ask around what happened or remind them to write down what they did on the document.
Supporting subject matter experts
Subject matter experts(SME) are the experts who understand the system and have the knowledge to decide what needs to be done to recover. SME’s will be the ones mostly doing the actual work. They are required to work fast and accurately, so they will need assistance to work without causing any human error. One way to support SMEs is by understanding which logs/codes they are looking into, or seeing what kind of changes they are making. Be the one who understands the situation and can do double-check or code review.
Facilitate postmortem meetings
What is a postmortem?
According to Site Reliability Engineering, “A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.”(Chapter 15- Postmortem Culture: Learning from Failure). Postmortem is the key to sharing ideas to improve the system and operation. After working on the incident, it is essential to make sure that the team looks back on their actions and doesn’t leave anything unchecked. Facilitating a meeting to write an effective postmortem can be an effective way to help the team.
Prep for a postmortem meeting
Write an outline of the postmortem. Postmortem should be written by the engineers who actually did the work since they know exactly what happened and what they discovered during the incident. However, it is always helpful when there is an outline with all the obvious facts already written down. That way, the engineers can focus on writing what they actually did and what they thought during the process. Make sure everyone has the same understanding of facts. Laying out all the actions on a whiteboard is also useful. Engineers(and other participants) can use it for discussions and analysis of how they managed the incident.
Title
Date
Authors
Status
Summary
Impact
Root Causes
Trigger
Resolution
Detection
Action Items
---
Lessons Learned
- What went well
- What went wrong
- Where we got lucky
---
Timeline
---
Supporting information
Facilitate
Incidents and outages can be a trigger to look into system vulnerability and how the team operates. As a facilitator, when hosting a postmortem meeting, make sure discussions are focused on learning from the events and not blaming someone. Ask questions that lead to new ideas.
- What went well and what went wrong?
- Where did we get lucky?
- What was the trigger?
- Can we change any operations?
- Do we need to change our design?
- At which point did our system recover?
- What can we do to recover our system faster?
- What was the user impact?
- What was the revenue impact?
- How was the impact on the team?
Encourage members to list all ideas including ones that can be difficult to implement or something that can be done in a long term. After all the ideas are out, select the ones to act on and add them to the postmortem.
Conclusion
Incidents can be daunting but it is also a valuable opportunity to learn. System vulnerability and operation errors are more likely to be discovered. Incident management is not just about resolving the issue. Ongoing investment in cultivating operations and postmortem culture can lead to fewer outages, more trust in the product, and happier teammates.