Hacker News new | ask | show | jobs
by bufferoverflow 1659 days ago
I still don't get what caused 100GB of logs. Idle events every 5 minutes should not generate that much data, so what am I missing?
1 comments

when turning on logical replication postgres keeps all wal files until they are also applied to the downstream consumer. if you don't use logical replication they get cleaned up by setting wal_keep_segments and max_wal_size (which have a sane value so there is nothing to bother)

btw. a good idea is to keep these wal files or at least archive them with a basebackup, because it enables PITR recovery (https://www.postgresql.org/docs/9.3/continuous-archiving.htm...) via wal-g or wal-e.

btw. the bigger mistake here is that a lot of people do not monitor their database size, which is a huge problem.

edit: ah and wal files are always 16mb (unless compiled differently)

> There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files. Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for WAL. Ordinarily, when old log segment files are no longer needed, they are recycled (renamed to become the next segments in the numbered sequence). If, due to a short-term peak of log output rate, there are more than 3 * checkpoint_segments + 1 segment files, the unneeded segment files will be deleted instead of recycled until the system gets back under this limit.

Yes, but this system seems to have generated WAL at a very high rate for something that is a test system with no load.

It went from effectively 0 to 97 GB in less than 6 days; which is 673 MB/hr; 11MB /min, or 58 MB /checkpoint timer.

If we consider that most, if not all, of this traffic would be based on checkpoint-related full-page writes, that's still some 7000 distinct pages being modified every 5 minutes. Even when I consider pg_stat_statements recording and writing out all activities, modifying ~7000 blocks is a lot for what I can expect to be repeating and templated maintenance tasks.

So this was 21 days of idle chatter? 1 WAL per 5 minute event, 16MB per WAL, 100GB/16MB = 6250 * 5m = 31,250m/1440m/day = 21ish days.

Is that what happened?

In the post there is actually a graph showing that Transaction Logs filled the database in just 6 days (11/24 through 11/30). This doesn't add up to ~100 GB if we assume that each event takes up 16MB of space, but maybe each event took up even more storage than that.
probably. well in his timeline it was only two weeks less, but without more information (which is probably lost) it would be hard if the problem was made worse and other stuff accumulated. but yeah wal file are probably the culprint here. btw. most of the time it's not just aws rds, as somebody else stated a few things in postgres do happen to write data into wal files.

there is a lot more to it and depending on the config and replication it can behave a little bit differntly i.e. under normal use these files can get reused depending on min_wal_size/max_wal_size,etc... it's quite a complex thing and normally do not need to be fine tuned.

I'm also pretty sure the behavior changed in more recent versions (which probably made it worse in this scenario, but more default safety for everbody.)