| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by luispedrocoelho 3053 days ago

`isin` is worse in terms of performance as it does linear iteration of the array.

Reading in chunks is not bad (and you can just use `chunksize=...` as a parameter to `read_csv`), but pandas `read_csv` is not so efficient either. Furthemore, even replacing `isin` with something like `df['id'].map(interesting.__contains__)` still is pretty slow.

Btw, deleting `interesting` (when it goes out of scope) might take hours(!) and there is no way around that. That's a bona fides performance bug.

In my experience, disk IO (even when using network disks) is not the bottleneck for the above example.

1 comments

proto-n 3052 days ago

Ok, I said I wasn't sure about the implementation, so I looked it up. In fact `isin` uses either hash tables or np.in1d (for larger sets, since according to pandas authors it is faster after a certain threshold). See https://github.com/pandas-dev/pandas/blob/master/pandas/core...

link