Hacker News new | ask | show | jobs
by aldanor 3295 days ago
So many people don't realize pandas can be horribly slow if you use it "wrong" -- i.e., for computations that don't vectorize in the way that's native for pandas. Also, working with dataframes that contain millions of rows is like playing a Russian roulette -- there's usually many ways to do the same thing in pandas, if you guessed correct you'll wait a minute or two till the computation's done, if you guessed wrong it'll run out of ram, segfault or never finish.

For big datasets, I've stopped using pandas myself a few years back for anything other than printing dataframe, datetime index series, doing quick plots, or working with tiny/toy datasets -- in favor of numpy structured/record arrays. It's kind of the same thing, without all the groupby/index fluff, but very fast.

Just last week, I've helped my colleague speed up her code (numerical solver for financial data) by more than 100x, the biggest part of it was ditching pandas entirely and using numpy.

2 comments

So I've been learning Pandas after mostly using either standard Python, R or VB to do our analysis, and I'm glad I read this because I thought I was going crazy.

I have a data set of about 4 million rows I routinely analyze. I have 32 gb of space on my desktop, and the only time I've really run out is when I write incredibly poor code. In the short while I've been trying to use Pandas run out of memory and get killed by the OOM killer or completely freeze my system for half an hour while processing what I thought were simple operations.

I was honestly beginning to believe I was way worse at programming than I thought due to all of the issues I was having. I wasn't even doing anything particularly complex, I was just loading a dataframe from a sql query and playing around with basic manipulation.

I'm glad you are sharing this. I've made the same experience - in our code, we ditched Pandas entirely for structured arrays. We also used numpy record arrays at first but found them to somehow be significantly slower than structured arrays, and since the former just add syntactic sugar to the latter, we're now running entirely on structured numpy arrays.