|
|
|
|
|
by aldanor
3295 days ago
|
|
So many people don't realize pandas can be horribly slow if you use it "wrong" -- i.e., for computations that don't vectorize in the way that's native for pandas. Also, working with dataframes that contain millions of rows is like playing a Russian roulette -- there's usually many ways to do the same thing in pandas, if you guessed correct you'll wait a minute or two till the computation's done, if you guessed wrong it'll run out of ram, segfault or never finish. For big datasets, I've stopped using pandas myself a few years back for anything other than printing dataframe, datetime index series, doing quick plots, or working with tiny/toy datasets -- in favor of numpy structured/record arrays. It's kind of the same thing, without all the groupby/index fluff, but very fast. Just last week, I've helped my colleague speed up her code (numerical solver for financial data) by more than 100x, the biggest part of it was ditching pandas entirely and using numpy. |
|
I have a data set of about 4 million rows I routinely analyze. I have 32 gb of space on my desktop, and the only time I've really run out is when I write incredibly poor code. In the short while I've been trying to use Pandas run out of memory and get killed by the OOM killer or completely freeze my system for half an hour while processing what I thought were simple operations.
I was honestly beginning to believe I was way worse at programming than I thought due to all of the issues I was having. I wasn't even doing anything particularly complex, I was just loading a dataframe from a sql query and playing around with basic manipulation.