| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aldanor 3058 days ago

@ the OP - not to sound hostile, but you write code (like in the example here [1]) that is bound to be slow, just from a glance at it. vstacking, munging with pandas indices (and pandas in general), etc; in order for it to be fast, you want pure numpy, with as little allocations happening as possible. I help my coworkers “make things faster” with snippets like this all the time.

If you provide me with a self-contained code example (with data required to run it) that is “too slow”, I’d be willing to try and optimise it to support my point above.

Also, have you tried Numba? It maybe a matter of just applying a “@jit” decorator and restructuring your code a bit in which case it may get magically boosted a few hundred times in speed.

[1] https://git.embl.de/costea/metaSNV/blob/master/metaSNV_post....

4 comments

luispedrocoelho 3058 days ago

That is the _FAST_ version of the code (people keep saying "of course, it's slow", when it's the fast version).

Here is an earlier version (intermediate speed): https://git.embl.de/costea/metaSNV/commit/ff44942f5f4e7c4d0e...

It's not so easy to post the data to reproduce a real use-case as it's a few Terabytes :)

Here's a simple easy code that is incredibly slow in Python:

    interesting = set(line.strip() for line in open('interesting.txt'))
    total = 0
    for line in open('data.txt'):
        id,val = line.split('\t')
        if id in interesting:
           total += int(val)

This is not unlike a lot of code I write, actually.

link

proto-n 3058 days ago

I've also found that loops with dictionary (or set) lookups are a pain point in python performance. However, this example strikes me as a pretty-obvious pandas use-case:

    interesting = set(line.strip() for line in open('interesting.txt'))
    total=0
    for c in chunks: # im lazy to actually write it
        df = pd.read_csv('data.txt', sep='\t', skiprows=c.start, nrows=c.length, names=['id','val'])
        total += df['val'][df['id'].isin(interesting)].sum()

I'm not exactly sure, but pretty sure that isin() doesn't use python set lookups, but some kind of internal implementation, and is thus really fast. I'd be quite surprised if disk IO wasn't the bottleneck in the above example.

link

luispedrocoelho 3058 days ago

`isin` is worse in terms of performance as it does linear iteration of the array.

Reading in chunks is not bad (and you can just use `chunksize=...` as a parameter to `read_csv`), but pandas `read_csv` is not so efficient either. Furthemore, even replacing `isin` with something like `df['id'].map(interesting.__contains__)` still is pretty slow.

Btw, deleting `interesting` (when it goes out of scope) might take hours(!) and there is no way around that. That's a bona fides performance bug.

In my experience, disk IO (even when using network disks) is not the bottleneck for the above example.

link

proto-n 3057 days ago

Ok, I said I wasn't sure about the implementation, so I looked it up. In fact `isin` uses either hash tables or np.in1d (for larger sets, since according to pandas authors it is faster after a certain threshold). See https://github.com/pandas-dev/pandas/blob/master/pandas/core...

link

aldanor 3058 days ago

Could you give a hint of how the data ("sample1", "sample2") looks like, or how to randomly generate it in order to benchmark it sensibly? I guess these are similarly-indexed float64 series where the index may contain duplicates? Maybe you could share a chunk of data (as input to genetic_distance() function) as an example if it's not too proprietary and if it's sufficient to run a micro benchmark.

There's also code in genetic_distance() function that IIUC is meant to handle the case when sample1 and sample2 are not similarly-indexed, however (a) you essentially never use it, since you only pass sample1 and sample2 that are columns of the same dataframe (what's the point then?), and (b) your code would actually throw an exception if you tried doing that.

P.S. I like the part where you've removed the comment "note that this is a slow computation" :)

link

onuralp 3058 days ago

Have you checked out scikit-allel? It is fairly comprehensive in terms of calculating basic population stats, and the developer is highly active.

scikit-allel: http://scikit-allel.readthedocs.io/en/latest/index.html

scikit-allel example: http://alimanfoo.github.io/2015/09/21/estimating-fst.html

zarr: https://github.com/zarr-developers/zarr

link

BerislavLopac 3058 days ago

The speed could possibly be improved by using map. Also, not related to speed if this is all of the code, but might affect it in a larger programs: you should make sure your file pointers are closed. Something like:

    with open('interesting.txt') as interesting_file:
        interesting = {line.strip() for line in interesting_file}
    with open('data.txt') in data_file:
        total = sum(int(val) for id, val in map(lambda line: line.split('\t'), data_file) if id in interesting)

link

jkabrg 3058 days ago

`map` is not going to make it faster. `map` is a loop. Only vectorized code is faster.

link

deckiedan 3058 days ago

Have you tried using Cython to compile code like the above? Python's sets / maps / reading data etc should be fairly optimised, so Cython might let you bypass boxing counter variables instead using native C ints or whatever.

Also, if the data you're reading is numeric only - or at least non-unicode / character data - you might be able to get a speed boost reading the data as binary not as python text strings.

link

peterwaller 3058 days ago

> you write code (like in the example here [1]) that is bound to be slow, just from a glance at it > [1] https://git.embl.de/costea/metaSNV/blob/master/metaSNV_post.....

Given his code you referenced, could you elaborate on what makes it look slow at a glance, and how you might speed it up? :)

link

anaptdemise 3058 days ago

ln 221:

    if snp_taxID not in samples_of_interest.keys():#Check if Genome is of interest

Tracing through, looks like samples_of_interest is a dict. `snp_taxID not in samples_of_interest` would make membership check constant time.

link

ihnorton 3058 days ago

> Also, have you tried Numba?

Numba does not support dictionaries and has limited support for pandas dataframes (only underlying arrays, when convertible to NumPy buffers, if I understand correctly). This limits usefulness for many non-array situations, as well as some existing code-bases (the dictionary is fundamental in Python and typically used everywhere -- often for performance).

link

pbowyer 3058 days ago

Brian Moore's quip [0] about mod_rewrite comes to mind every time I use Numba:

"Despite the examples and docs, Numba is voodoo. Damned cool voodoo, but still voodoo"

0. https://httpd.apache.org/docs/2.0/rewrite/

link