Hacker News new | ask | show | jobs
by RadiozRadioz 586 days ago
> reviewing the code is a breeze

I have the opposite opinion. In a previous codebase I fought hard to use dataclasses & type hinting where possible over dictionaries, because with dictionaries you'd never know what type anything was, or what keys were present. That worked nicely and it was much easier to understand the codebase.

Now I've been put on a Pandas project and it's full of mysterious

    df = df[df["thing"] == "value"]
I just feel like we've gone back to the unreadability of dictionaries.

Everything's just called "df", you never know what type anything is without going in and checking, the structure of the frames is completely opaque, they change the structure of the dataframe halfway through the program. Type hinting these things is much harder than TypedDict/dataclass, at least doing it correctly & unambiguously is. It's practically a requirement to shove this stuff in a debugger/REPL because you'd have no chance otherwise.

Sure, the argument is that I'm just in a bad Pandas codebase, and it can be done much better. However what I take issue with is that this seems to be the overwhelming "culture" of Pandas. All Pandas code I've ever read is like this. If you look at tutorials, examples online, you see the same stuff. They all call everything the same name and program in the most dynamic & opaque fashion possible. Sure it's quick to write, and if you love Pandas you're used to it, but personally I wince every time I look in a method and see this stuff instead of normal code.

Personally I only use Pandas if I absolutely need it for performance, as a last resort.

2 comments

Is it just the generic, non-descriptive naming, or what do you think is the root of your distaste for pandas?

Like if we have a dataclass:

    obj.thing == value
Or SQL:

    SELECT * FROM table WHERE thing = ‘value’
We don’t know what the types are, either, without looking it up.

The fact the dataframe often changes halfway through the program is, I think, more to do with the task at hand, that often pandas is being used to perform data transformation (the T in ETL), where some raw data is read in, and the goal is literally to change the structure to clean it up and normalize it, so the data can be ingested into a SQL table in a consistent form with other data points.

But if transformation is not what you are doing, then yes, that might not be the right use of dataframes.

With the dataclass I can look at the class definition, with SQL I can look at the database schema in git, or at the very least log in and `DESCRIBE table`. With Pandas I can find where the dataframe is defined, but then I need to walk through any structural transformations made to it and keep track of its structure in my head. Alternatively I can run the Pandas program in a debugger, set a breakpoint and inspect the dataframe.

With all you need to do some work, but I find the Pandas one more involved because you don't have an authoritative "reference", just an initial state then some transformations. With the Pandas example I have to run the program (in my head or actually). The program might need to pull in test data (hopefully some has been provided). The worst is when the structure of the DF is derived from the source data rather than stated in code (e.g. reading in a CSV). It's much more to do than looking at a class definition or declarative database schema; there's a "sequence" to it, there are transformation steps happening to the DF that I need to keep track of.

As for the transformation thing, I'm totally on board with the need to transform data. What I'm specifically objecting to is the pattern of changing a variable's type during the program, which is extremely common in Pandas code. That is, reassigning the same variable with a value that has a different structure or types.

Here's a really common example where we select a subset of fields:

    df = ...
    df = df[["field1", "field2"]]
The DF has been transformed to have fewer columns than it did previously. Representing as types, it went from List[Dict[Literal["field1","field2","field3"], int]] to List[Dict[Literal["field1","field2"], int]]. We can no longer rely on field3 existing in the DF. Because this one variable has two possible types depending on where you are in the program, it's much harder to reason about.

This is a totally valid way to transform the data, but the manner in which the transformation is happening, I find, makes the code harder to reason about. And this is the manner of transformation I find most commonplace in Pandas practice. We could instead do the following, but I don't see it much:

    df = ...
    df_limited_fieldset = df[["field1", "field2"]]
And even in this case, to infer the structure of df_limited_fieldset, you need to determine the structure of df and then apply a transformation to it, unless you explicitly document its new structure somehow. With dataclasses, df_limited_fieldset would contain instances of an entirely new dataclass, stating its new type.

None of this is to say that abuse of dynamic types doesn't happen in normal Python, it totally does, but I've found these patterns to be so ingrained in Pandas culture / common practice that I'm comfortable characterising them as part of the nature of the tool.

Do we work at the same company?

You put it much better than I could have. Do you know if polars at all solves the problem of having opaque, mutable objects everywhere? I feel like there's a good market for having a dataframe library that's easier to reason about in your editor. It could even be a wrapper around pandas that adds rich typing sort of the way FastAPI does with Pydantic for Starlette.

With Polars you use `df.select()` or `df.with_columns()` which return "new" DataFrames - so you don't have mutable objects everywhere.

There is an SO answer[1] by the Polars author which may have some relevance.

[1]: https://stackoverflow.com/questions/73934129/