| Seconding this, there are a couple of things that jump out at me as immediately non-optimal, and which together would probably give an order of magnitude speedup. - Defining compute_diversity inside a double for loop - `sample1.ix[sample1.index[sample1.index.duplicated()]]` appears overengineered (I think you can just remove the `sample1.index` here (edit: you can't , but I think you could refactor to remove the indexing and reindexing and index resetting, and then you could)) - Depending on the data size, swapping from `[` to `(` everywhere would give a nice speedup just because you no longer need to store everything in memory/swap to disk, whereas in haskell the list comprehensions would be lazy by default. (edit: seeing as the databases downloaded are 12 and 33 GB, and Pandas requires generally 2-3X ram, its likely that there's swapping happening somewhere. I'd bet that using generators would be a big speed boost) - Overall I think genetic_distance can be significantly simplified, a lot of the index-massaging doesn't look necessary. I could be wrong, but this looks sloppy, and sloppy often implies slower than necessary. Unfortunately, the provided data files are big enough that I can't easily benchmark on my computer. I can't even fit the dataset in memory! |
The problem is libraries that works fine when everything fits in RAM start breaking down if you aren't careful. Not really python speed issue, but you lose some of the tools you relied on previously.