033: Postmortems

This week we talk about what you can do after a negative event happens at your company. Things like downtime or a security breach. First priority is solving the problem. Second priority is making sure it doesn’t happen again, and the first step toward that is a postmortem.

Postmortems

00:36 A project postmortem is a process, usually performed at the conclusion of a project, to determine and analyze elements of the project that were successful or unsuccessful. Project postmortems are intended to inform process improvements which mitigate future risks and to promote iterative best practices. postmortems are often considered a key component of, and ongoing precursor to, effective risk management. – Wikipedia, Postmortem Documentation

In tech, it’s a term that has come to mean a write up or procedure when something bad happens. If something breaks, fix it, but after that, look at what happened.

We’ve done this for a long time at CodePen. When we have downtime, we write up a postmortem report about a recent bug or failure in our process.

A Recent Postmortem

2:05 The most recent downtime happened when Alex was expecting to run a migration, but forgot that there was a migration set to run that was going to add a column to the Pens table (which takes about 5 hours). He tried to stop it, but once you start updating the database, it’s not really possible to stop it.

When you’re going to start a migration, there’s a command to run that says “please catch me up with all the migrations”. Alex forgot to run that command, and the issue was that because it took so long, and he ran it about 1am, he forgot to restart the web servers after that migration happened.

The site itself didn’t go down, but people weren’t able to save new pens for a few hours after that, and team CodePen was sleeping during that time.

We normally have different sleep schedules, so there’s not usually a large amount of time that we aren’t watching the servers, but the downtime fell into the worst time frame possible.

What Happened?

4:57 Alex was so focused on running the migration that he completely forgot that the servers would need to be restarted. That’s a failure in our process: instead of running the migration, he should run a system that runs the migration, and then immediately restarts the web servers.

5:51 So that’s one the things we looked at in our postmortem: what we could have done better to avoid this problem.

What should have happened was:

Migrations should be run when at least two members of the team are awake, so we can check on things that may have unexpectedly gone wrong
There are three steps for updating the database; every one of those steps is currently manual. We should have a process in place that automatically does all the steps required.

7:38 Part of the discuss centered around, “Why does the web server need to be restarted to run the migration?” Most of the other migrations we run don’t require that. So that’s another one of the questions that comes out of a post mortem.

Downtime Alerts

8:16 As a result of this downtime, we’re looking at getting a process that runs code in the browser and attempts to save a pen. If we had code running every 5 minutes that tried to save a pen, and would send an alert if there was an error, that would have saved us.

9:22 This problem is unique to running in production. It would have been really nice to have those smoke tests to alert us before our users had to tell us about the problem.

How We Found Out

10:12 Chris usually is the first one awake due to time zones, and he woke up to a whole bunch of emails and tweets. Everything from “You’re a bunch of idiots”, to, “Did I break something?”

The first thing Chris did was try to recreate the problem, and then he reached out to Alex through email, Slack, and text message.

11:46 Within minutes, the team was online, fixing the problem. We got it fixed pretty quickly, but we still had hours of downtime.

The “Blameless” Postmortem

“Having a “blameless” postmortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

what actions they took at what time,
what effects they observed,
expectations they had,
assumptions they had made,
and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.” – John Allspaw, Blameless PostMortems and a Just Culture

12:56 The last thing you want to do is discourage people from identifying what went wrong and how to fix it moving forward. A blameless postmortem encourages those responsible for the problem to identify the cause of the issue, and find ways to prevent it from happening in the future.

14:30 We’ve all have chances to break the site. We’ve all touched code that could accidentally take the system down, or break something. We all share the success, and the blame when something goes down.

15:37 Another recent postmortem was an issue we had with Reddis. Something happened to the box that Reddis was running on, and it (essentially) shut down. This started a chain reaction that caused the site to go down. Tim was out snowboarding, and he didn’t have his laptop with him, so he wasn’t able to fix the problem. He tried to talk Alex through the fixes required, but their attempts weren’t successful. So a conversation came out of that about having the proper tools with you at all times (if possible), and how to solve Reddis problems in the future.

How to Write a Postmortem Report

20:10 A postmortem can be just talking, which is fine, but the next step is to write a report so that you can reference it later. Writing things down forces you to think through every step, and you’ll internalize the experience better.

You don’t have to write a whole novel about what went wrong, a simple document with bullet points will suffice.

It’s really nice to know in-detail why things happened the way they happened. A postmortem report is a great way for the person responsible for the bug to thoroughly explain (in writing) to the other team members what exactly happened, and hopefully prevent that same problem from happening in the future.

Show Links:

John Allspaw – Blameless postmortems and a Just Culture

If you have a job opening at your company, post it on the CodePen Job Board! It only takes a few minutes and you’ll be reaching huge communities of potential candidates.

If you’re enjoying this show, please take a minute to leave us a review in iTunes. We really appreciate it, and thanks to everyone who has already left a review! (We read all of them)