Hacker News new | ask | show | jobs
by ololobus 2338 days ago
I dare to promote one StackOverflow question [1] about pandas I have tried to investigate and answer [2] half a year ago. And I was rather horrified by its internal complexity after digging into pandas source :)

OP was wondering, why pandas facing a strange overhead after each 100th iteration in some very specific case. There was a proposal about Python's GC, but it was not clear at all.

Finally, I have dived into pandas and found that it has a hard-coded constant == 100 (!) of a number of internal data storage blocks. After reaching this value it runs some consolidation routines [3], and they consume a lot of memory even leading to crash with memory error.

What was much more wondering, is that after changing this constant to some large value (1000000, actually it disables consolidation at all) reduces memory consumption dramatically! This consolidation seems to reduce storage and memory consumption, so I still do not know why the opposite happens and why it works well in all other cases.

[1] https://stackoverflow.com/questions/56690909/python-is-facin... [2] https://stackoverflow.com/a/56705419/978424 [3] https://github.com/pandas-dev/pandas/blob/761bceb77d44aa63b7...