We had a post mortem triple-header yesterday. It took all of forty minutes. We don’t strive to have incidents, nor do we strive to review them so quickly. But it’s important to work through what caused incidents and how we could avoid them in the future (preferably through outright eliminating error paths, but also by making checklists for the next time). The temptation is always to address problems “when we have time.” That time never comes! So I’m glad we squeezed these in even though we’re busy simplifying and stabilizing Repl.it.
In software, on the teams I’ve been on, a post mortem is a review of what led to a problem affecting a lot of people. We have a template that asks a few key questions. What was the problem? How did we detect it? What events led up to the problem and what steps did we take to fix it (the timeline)? What was the root cause? What went well? What could’ve gone better? What steps can we take now to reduce the chances of the similar problems happening in the future? A single person starts writing the post mortem doc (really, a markdown file sitting in a repl) but anyone involved in the incident can contribute. We then meet for about half an hour to walk through the timeline and the takeaways and figure out what steps we still need to take.
We don’t blame people during an incident or its aftermath. Usually when people make clear mistakes that lead to downtime, they’ll apologize profusely. I know I do. Everyone at Repl.it has been really good at assuring folks they shouldn’t feel bad. Sid likes to say “it could have been any of us.” That’s helped us trust each other during tough stretches. Beyond that, though, it’s allowed us to learn from our mistakes. When you’re busy it’s always tempting to say “we screwed up but we’ll be more careful next time.” This puts a lot of pressure on clearly-fallible humans to not err. If you go out of your way to not blame individuals, then you’re well on the way to not blaming the team or people in general. What made it easy to make this particular mistake? Can we configure our systems to not allow this path in the future? If not, can we at least develop clear habits or checklists to avoid this path?
I certainly don’t enjoy having incidents and I’d love for us to have fewer of them. But I’ve come to look forward to post mortems, because I know they’ll make us better. The most interesting and helpful post mortems examine our decision-making. Why, during an incident, did we not take a particular step (rolling back, cutting off traffic to a particular route, adding a database follower) sooner? Repeatedly asking these questions has led us to respond more decisively to problems, particularly by following the incredibly short checklist at outage.party.
The next step for us is to start sharing what we learn publicly. Not that anyone stays awake at night wondering about our production configuration (though maybe they do, I haven’t directly asked anyone). It’s more that it’s important for everyone depending on Repl.it to stay running (more and more folks these days) to know when we’ve hit a routine speed bump versus something completely new to us.