I caught this fast enough that I wonder if anyone even noticed, but this is the first unplanned downtime for the instance I’ll go ahead and report on it anyway.
The instance has been consistently running at approximately 75-80% memory usage since it started. This morning (in my local time zone) the server experienced a memory spike. This spike caused the kernel to start swapping excessively, which in turn spiked CPU usage to 100%.
Timeline of Events
- 2023-06-14 17:11 UTC: Memory hit 100%, caused excessive swapping and 100% CPU usage
- 2023-06-14 17:14 UTC: Issue was noticed thanks to monitoring tools
- 2023-06-14 17:15 UTC: Shut down node, initiated node resize to double RAM and CPU
- 2023-06-14 17:17 UTC: Server back up and running
Evaluation
I was already a little concerned about running the server so close to the edge of memory. This event proves that it’s not tenable. With the new node we now have 4GB of RAM and 2 CPU cores compared to 2GB and 1 core before. We are currently sitting at 36% memory usage and ~6% CPU. In general Lemmy is CPU bound for large user counts, so this server size should support us up to several hundred users. I expect memory will not be a concern for the foreseeable future.
As for what caused the event in the first place, I’m not sure. Server logs around the time of the event don’t look unusual, but it’s not practical to log all inbound activity since every comment and every vote on every instance that we are subscribed to is a separate HTTP request. It’s possible that the spike was a result of bursty federation messages from lemmy.ml, which has been struggling under load the last couple of days.
I feel pretty confident about the new server hardware for now so I will continue to diligently monitor system performance, but I don’t expect this particular issue to recur.