This is a semi-unstructured brain dump on my approach to leading incidents in a big company. In a sentence: compassionate focused authority providing direction and clarity.
What
“Incident commander” is the role given during an incident to the person responsible for ensuring the incident is handled correctly. Depending on the scale of the incident, they might also be the one fixing the problem, and leading communication. In bigger incidents, these would be split into different roles. In smaller incidents, the incident commander takes the other two roles too.
After the incident, the incident commander will typically also be the one who leads the post mortem.
Why
During small incidents, an assigned person who can fix the problem and follow up on it comes relatively naturally. If a company has an on-call rotation, it’ll usually be the person on-call. Otherwise, it’s the person who broke the code. They know what broke, so they’ll know how to fix it.
During a larger incident, many different pieces might be moving at once. There may be dozens or hundreds of people in the incident channel. Coordination of efforts, to provide a good plan of action, requires energy and effort. Therefore the incident commander is there to pull the needed people-resources together to solve a problem.
There may be too many people in the incident channel. The collective audience may sit back and figure that someone else might solve it. In these cases, the incident commander is there to assign specific actions to people and avoid the bystander effect.
Even when multiple people are actively working on solving the problem, having an assigned incident commander can be useful.
In a big incident in a big company, there will probably be multiple teams both affected, and contributing. The incident commander can connect the dots and ensure the right people are involved.
How
Incidents tend to start broad. You know the problem, but not the solution. As the incident evolves, more problems may occur. There may be multiple solutions. As an incident commander, your role is to bring all this information together to effectively solve the problem quickly without further compromising the service.
Evaluate the impact on business.
Is the problem critical to the operation of your company? If it’s not, then it could probably wait for a normal bug fix.
Be aware of the resources you have available.
How many people are actively looking into the problem? Do you need to pull in more people? Do so. If you need to get your hands dirty, do so.
Be direct in the actions you want specific people to do.
<what> <why> <who>
Ask questions.
If you don’t know something, or think something is being missed, ask in the incident channel. Questions will direct others towards finding a solution.
Be available to answer questions.
You do not have to answer them yourself, but you should make sure the questions have an answer from someone.
Gather information, and summarize it.
Make sure that everyone has the same information. Avoid crossing wires.
Only communicate regarding fixing the problem in the incident channel, unless it involves security or user data.
When discussions move to direct messages or side channels, it becomes harder to collectively understand and address the problem.
If something is difficult to explain over text, jump on a call together.
But keep calls short lived so that you can focus on coordinating.
Over communicate, with regularity.
Uncertainty in the actions taken or information leads to confusion and chaos.
Show compassion for those helping.
Some people enjoy fixing things, regardless of the source of the problem. I’m one of those. Others find it stressful and demanding. Thank people, make sure the tasks are divided.
Acknowledge your own limitations, and raise them when appropriate.
Sometimes a limitation is:
A lack of knowledge about a system, in which case the incident commander should lean on others to provide the knowledge.
Being tired, in which case someone else should take over so you can rest.
Having other obligations (e.g picking up your kids), in which case someone else should take over unless the incident is not currently active.
Useful skills
Knowing the organisation, the teams, and who might be useful to fix the problem means it’s easier to get the right people involved. I approach this during peaceful times by getting to know people. Then, during incidents, it’s easier to reach out to people when you know their expertise.
A calm, focused, yet forceful approach can cut through chaos.
Don’t fear code. I often look at codebases I’ve never seen before, in languages I don’t typically use. Languages are mostly the same, and you’re mostly trying to identify the connection of systems.
Use Git history. Often systems break due to recent code merges. Find where they are, then revert them. Having a locally running environment or a staging environment is very handy.
After
Take the learnings forward. Part of a mature incident handling process involves taking steps to prevent the same problem happening again. But it’s also important to improve the incident handling process itself. Errors, bugs, mistakes are always going to happen. Handling the unexpected in a good way is a sign of a company’s technical competence.
Thank the people involved. It is so easy for those behind the scenes to be forgotten. So make sure to thank them. An incident is rarely fixed by one person. Together, we build things. Together, we break things. And together, we fix things.