| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by malux85 3626 days ago

Room for a DevOps story?

I was managing a very busy Cassandra cluster of 14 nodes, these machines had 16GB of ram, and as a very horrible hack for OOM problems, had 8GB of swap space too. With Cassandra - you have to always have at least N megabytes of space free on the disk where N is the size of your largest table, this is because during "Major compactions" it's possible the table will have to be rewritten.

One of these machines was desperately running out of disk space, so I turned the swap space off to claim the extra 8GB (I just had to get the compaction out of the way, then these machines would be upgraded and this all wouldn't be a problem anymore)

So turning off the swap space, I could see the kernel moving data back into RAM, and I was also watching the diskspace fill up from the compaction. They were both going at a fairly linear rate, but the system was going to run out of disk space BEFORE the swapfile was released.

But not by much, the system would run out of diskspace about 30 seconds before the swapspace would be free --- now I know that this cassandra configuration was set so that a node wouldn't be considered "dead" unless it was out of communication with the cluster for longer than 10 seconds --- so I used KILL to freeze the cassandra process a few times, but never longer than 10 seconds.

I was able to freeze the process enough so that the swapspace was free'd before the diskspace ran out -- and the node communicated with the cluster enough to remain "active"

Lesson Learned: These machines were severely underspec'd!