Hacker News new | ask | show | jobs
by hidenotslide 3054 days ago
I don't find this a very compelling argument. The author doesn't mention any attempts to profile or speed up the code.

Specifically with pandas I've found if you aren't careful you can do a lot of unnecessary copying. Not sure if that's what is going on here, but cProfile can help find the bottlenecks.

2 comments

Seconding this, there are a couple of things that jump out at me as immediately non-optimal, and which together would probably give an order of magnitude speedup.

- Defining compute_diversity inside a double for loop

- `sample1.ix[sample1.index[sample1.index.duplicated()]]` appears overengineered (I think you can just remove the `sample1.index` here (edit: you can't , but I think you could refactor to remove the indexing and reindexing and index resetting, and then you could))

- Depending on the data size, swapping from `[` to `(` everywhere would give a nice speedup just because you no longer need to store everything in memory/swap to disk, whereas in haskell the list comprehensions would be lazy by default. (edit: seeing as the databases downloaded are 12 and 33 GB, and Pandas requires generally 2-3X ram, its likely that there's swapping happening somewhere. I'd bet that using generators would be a big speed boost)

- Overall I think genetic_distance can be significantly simplified, a lot of the index-massaging doesn't look necessary. I could be wrong, but this looks sloppy, and sloppy often implies slower than necessary.

Unfortunately, the provided data files are big enough that I can't easily benchmark on my computer. I can't even fit the dataset in memory!

Good point about pandas dataframes taking up extra space, and the solution of using chunking/generators. 33 GB is what Wes McKinney would call "medium data": https://twitter.com/wesmckinn/status/413159516096585729

The problem is libraries that works fine when everything fits in RAM start breaking down if you aren't careful. Not really python speed issue, but you lose some of the tools you relied on previously.

You are commenting on the variant of the code that is fast enough that it doesn't matter.
While that may be true, my point is that it is almost certainly possible to make your code go faster than it is already, and also become more readable in the process.

And so saying that python is either slow or ugly and unreadable is perhaps an unfair characterization. I may be wrong here. I haven't benchmarked the code in question, but I think that even for the algorithm you're trying to do, with the special casing, that function could be significantly simplified.

Edit: I'd be curious to see example data that is passed into this function.

That may be the case. However, my point is that we started with a rather direct implementation of a formula in a paper. This was very easy to write but took hours on a test set (which we could extrapolate to taking weeks on real data!).

Then, I spent a few hours and ended up with that ugly code that now takes a few seconds (and is dominated by the whole analysis taking several minutes, so it would not be worth it even if you could potentially make this function take zero time).

Maybe with a few more hours, I could get both readability and speed, but that is not worth it (at this moment, at least).

*

The comment about the benchmark data being large is exactly my point: as datasets are growing faster than CPU speed, low-level performance matters more than it did a few years ago (at least if you are working, as I am, with these large data).

Right, and my point is that you could probably

1. Have gotten similar performance boosts elsewhere, meaning that you wouldn't have needed to refactor this function in the first place (although the implication of a 10000x speedup means that may not be true, although I can absolutely see the potential for 100x speedups in this code, depending on exactly what the input data is)

2. Its likely that there are much more natural ways to implement the function you have in pandas more idiomatically. These would be both clearer and likely equally fast, though possibly faster. (heck, there are even ways to refactor the code you have to make it look a lot like the direct from the paper impl)

In other words, this isn't (necessarily) a case of python having weak performance, its a case of unidiomatic python having weak performance. This is true in any language though. You can write unidiomatic code in any language, and more often than not it will be slower than a similar idiomatic method (repeatedly apply `foldl` in haskell). I'm not enough of an expert in pandas multi-level indexes to say that for certain, but I'd bet there are more efficient ways to do what you're doing from within pandas that look a lot less ugly and run similarly fast.

Granted, there's an argument to be made that the idiomatic way should be more obvious. But "uncommon pandas indexing tools should be more discoverable" is not the same as "python is unworkably slow".

1. No, that function was the bottleneck, by far, and I can tell you that >10,000x was what we got between the initial version and the final one.

2. I don't care about faster at this point. The function is fast enough. Maybe there is some magic incantation of pandas that will be readable and compute the same values, but I will believe it when I see it. What I thought was more idiomatic was much slower.

I think this is more of a case of "the problem does not fit numpy/pandas' structure (because of how the duplicated indices need to be handled), so you end up with ugly code."

Well, I just wanted to use pandas to load a 4GB csv file. After using 32GB of my RAM, and 4GB of swap I gave up. I've just loaded all that data to Postgres, and made a couple of queries. This way I stopped using pandas at all.
I found that pandas is great for data exploration and data that you know is small (few 100s MB). Other than that, Python builtins and numpy arrays are a better alternative.
I hardly use pandas at this point besides read_csv, which is very good once you know the syntax for parsing strings/dates, skipping rows, dropping columns, etc.

After that I usually just keep the numpy array since all I need is floats. I guess the index groupby stuff is cool, but I never really needed it. Postgres is fine but if you're just doing numerics it doesn't help much.

It helps with having smaller RAM requirement. And I have the group by, and materilized indices, which helps a lot to preserve huge modified datasets.