|
|
|
|
|
by hidenotslide
3054 days ago
|
|
I don't find this a very compelling argument. The author doesn't mention any attempts to profile or speed up the code. Specifically with pandas I've found if you aren't careful you can do a lot of unnecessary copying. Not sure if that's what is going on here, but cProfile can help find the bottlenecks. |
|
- Defining compute_diversity inside a double for loop
- `sample1.ix[sample1.index[sample1.index.duplicated()]]` appears overengineered (I think you can just remove the `sample1.index` here (edit: you can't , but I think you could refactor to remove the indexing and reindexing and index resetting, and then you could))
- Depending on the data size, swapping from `[` to `(` everywhere would give a nice speedup just because you no longer need to store everything in memory/swap to disk, whereas in haskell the list comprehensions would be lazy by default. (edit: seeing as the databases downloaded are 12 and 33 GB, and Pandas requires generally 2-3X ram, its likely that there's swapping happening somewhere. I'd bet that using generators would be a big speed boost)
- Overall I think genetic_distance can be significantly simplified, a lot of the index-massaging doesn't look necessary. I could be wrong, but this looks sloppy, and sloppy often implies slower than necessary.
Unfortunately, the provided data files are big enough that I can't easily benchmark on my computer. I can't even fit the dataset in memory!