Hacker News new | ask | show | jobs
by 2devnull 911 days ago
>” How to prevent this? The obvious way is to just not use dataframes, at least not while doing aggregation. Rather than allocating a huge dataframe and loading our partial results into its columns bit by bit, we can just store our partial results in a plain list.”

A lot to be said for not defaulting to data frames, in both r and python. Or, if you must, using something like r data.table or python’s polars if you don’t think in other data structures easily or just want convenience.

2 comments

> A lot to be said for not defaulting to data frames, in both r and python

I would even add especially in Python. The main issue I have found is that pandas heavy code is just not as easy to integrate into other Python tools/features/abstractions as code using mostly numpy, dictionaries and various comprehensions to do the vast majority of your work.

As a heavy pandas user for several years, I decided about a year ago to not import pandas by default and instead treat most data problems like regular python problems. I've been genuinely surprised as how much easier it is to create useful abstractions with the code I've been writing, and also how much easier it's been to onboard non-DS devs into the code base.

There are a few obvious cases when Pandas is very helpful, and I'll pull it out in those places, but I've been able to do a tremendous amount of data work in the last year and used very little pandas. The result is that I have an actual codebase to work with now rather than a billion broken notebooks.

> The result is that I have an actual codebase to work with now rather than a billion broken notebooks.

This is the biggest part. Giving yourself permission to make real abstractions, rather than forcing yourself to go directly from data-on-disk to pandas (or whatever) makes it that much easier to test, repeat, modify, and extend whatever analysis you're working on.

In what cases have you found it worthwhile to use pandas?
Resampling, regularizing, binning and forward/backward filling time series data is an absolute pain in the ass using only SQL and/or vanilla python. It does its thing well, there.

(Note that in general, I'm the biggest pandas hater I know)

It can be nice for groupby-aggregate logic. And it feeds into plotnine.
For sure! Definitely an good thing to know for an R newbie like me who is handling large datasets naively.

Thanks for mentioning polars, I hadn't heard of it before but it looks neat.