Postmortem: Discuss outage (2021-11-12)

At ~8PM @wolf reported via slack that Discuss was serving 500 errors

After some looking around I noticed that the root was out of free space

I manually deleted an old Discuss backup, that was 1.2 GB in size, and then it started working again. The backup that I deleted I manually copied off of the server.

I noticed that Scuttlebutt was a very large user of system space, ~40GB, (1/3rd of total space) I stopped Scuttlebutt, and then backed up the data and removed it.

I think that moving forward, we should make sure to put each service inside it’s own container with a quota so that we can prevent one service from taking out another. Also, if we get enough interest, we can start enabling some active monitoring, so raise awareness of system health issues, like running out of disk space…

Thanks Dave. I can try to help with some of these tasks too as I have done a bit of server management.

