| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kbelder 1878 days ago

I've been doing a lot of data analysis in Pandas recently. I started off thinking that for efficiency's sake, I should do as much initial processing in the DB as possible, and use Pandas just for the higher level functions that were difficult to do in SQL.

But after some trial and error, I find it much faster to pull relatively large, unprocessed datasets and do everything in Pandas on the local client. Faster both in total analysis time, and faster in DB cycles.

It seems like a couple of simple "select * from cars" and "select * from drivers where age < 30", and doing all the joining, filtering, and summarizing on my machine, is often less burdensome on the db than doing it up-front in SQL.

Of course, this can change depending on the specific dataset, how big it is, how you're indexed, and all that jazz. Just wanted to mention how my initial intuition was misguided.