|
|
|
|
|
by luispedrocoelho
3053 days ago
|
|
`isin` is worse in terms of performance as it does linear iteration of the array. Reading in chunks is not bad (and you can just use `chunksize=...` as a parameter to `read_csv`), but pandas `read_csv` is not so efficient either. Furthemore, even replacing `isin` with something like `df['id'].map(interesting.__contains__)` still is pretty slow. Btw, deleting `interesting` (when it goes out of scope) might take hours(!) and there is no way around that. That's a bona fides performance bug. In my experience, disk IO (even when using network disks) is not the bottleneck for the above example. |
|