| > reviewing the code is a breeze I have the opposite opinion. In a previous codebase I fought hard to use dataclasses & type hinting where possible over dictionaries, because with dictionaries you'd never know what type anything was, or what keys were present. That worked nicely and it was much easier to understand the codebase. Now I've been put on a Pandas project and it's full of mysterious df = df[df["thing"] == "value"]
I just feel like we've gone back to the unreadability of dictionaries.Everything's just called "df", you never know what type anything is without going in and checking, the structure of the frames is completely opaque, they change the structure of the dataframe halfway through the program. Type hinting these things is much harder than TypedDict/dataclass, at least doing it correctly & unambiguously is. It's practically a requirement to shove this stuff in a debugger/REPL because you'd have no chance otherwise. Sure, the argument is that I'm just in a bad Pandas codebase, and it can be done much better. However what I take issue with is that this seems to be the overwhelming "culture" of Pandas. All Pandas code I've ever read is like this. If you look at tutorials, examples online, you see the same stuff. They all call everything the same name and program in the most dynamic & opaque fashion possible. Sure it's quick to write, and if you love Pandas you're used to it, but personally I wince every time I look in a method and see this stuff instead of normal code. Personally I only use Pandas if I absolutely need it for performance, as a last resort. |
Like if we have a dataclass:
Or SQL: We don’t know what the types are, either, without looking it up.The fact the dataframe often changes halfway through the program is, I think, more to do with the task at hand, that often pandas is being used to perform data transformation (the T in ETL), where some raw data is read in, and the goal is literally to change the structure to clean it up and normalize it, so the data can be ingested into a SQL table in a consistent form with other data points.
But if transformation is not what you are doing, then yes, that might not be the right use of dataframes.