|
|
|
|
|
by 2devnull
911 days ago
|
|
>” How to prevent this? The obvious way is to just not use dataframes, at least not while doing aggregation. Rather than allocating a huge dataframe and loading our partial results into its columns bit by bit, we can just store our partial results in a plain list.” A lot to be said for not defaulting to data frames, in both r and python. Or, if you must, using something like r data.table or python’s polars if you don’t think in other data structures easily or just want convenience. |
|
I would even add especially in Python. The main issue I have found is that pandas heavy code is just not as easy to integrate into other Python tools/features/abstractions as code using mostly numpy, dictionaries and various comprehensions to do the vast majority of your work.
As a heavy pandas user for several years, I decided about a year ago to not import pandas by default and instead treat most data problems like regular python problems. I've been genuinely surprised as how much easier it is to create useful abstractions with the code I've been writing, and also how much easier it's been to onboard non-DS devs into the code base.
There are a few obvious cases when Pandas is very helpful, and I'll pull it out in those places, but I've been able to do a tremendous amount of data work in the last year and used very little pandas. The result is that I have an actual codebase to work with now rather than a billion broken notebooks.