| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Szpadel 895 days ago

I was setting fairly important database with Zalando pg operator and after first good impressions it went downhill. after like a month of use WAL files used for point in time recovery started failing to offload to dedicated nodes and kept growing on database pods filling up all the space. I firstly assumed that maybe there is not enough space for some scheduled work (I do not really know details how this process work, I assumed that operator should handle all implementation details for me) but even after upscaling database 2.5x it just kept failing with full storage and requiring manual recovery to bigger storage, where most of it was WAL files.

HA didn't handled this case at all whole cluster went in crash loop

there was also issue of huge pages caused crashing and not easy way to disable those without some dirty injecting of config files at runtime

there could be some my fault at misconfiguration on by side, but I wasn't able to figure anything better from docs