| HN Mirror

interesting = set(line.strip() for line in open('interesting.txt')) total=0 for c in chunks: # im lazy to actually write it df = pd.read_csv('data.txt', sep='\t', skiprows=c.start, nrows=c.length, names=['id','val']) total += df['val'][df['id'].isin(interesting)].sum()

`isin` is worse in terms of performance as it does linear iteration of the array.

Reading in chunks is not bad (and you can just use `chunksize=...` as a parameter to `read_csv`), but pandas `read_csv` is not so efficient either. Furthemore, even replacing `isin` with something like `df['id'].map(interesting.__contains__)` still is pretty slow.

Btw, deleting `interesting` (when it goes out of scope) might take hours(!) and there is no way around that. That's a bona fides performance bug.

In my experience, disk IO (even when using network disks) is not the bottleneck for the above example.