Hacker News new | ask | show | jobs
by halfcat 586 days ago
Is it just the generic, non-descriptive naming, or what do you think is the root of your distaste for pandas?

Like if we have a dataclass:

    obj.thing == value
Or SQL:

    SELECT * FROM table WHERE thing = ‘value’
We don’t know what the types are, either, without looking it up.

The fact the dataframe often changes halfway through the program is, I think, more to do with the task at hand, that often pandas is being used to perform data transformation (the T in ETL), where some raw data is read in, and the goal is literally to change the structure to clean it up and normalize it, so the data can be ingested into a SQL table in a consistent form with other data points.

But if transformation is not what you are doing, then yes, that might not be the right use of dataframes.

2 comments

With the dataclass I can look at the class definition, with SQL I can look at the database schema in git, or at the very least log in and `DESCRIBE table`. With Pandas I can find where the dataframe is defined, but then I need to walk through any structural transformations made to it and keep track of its structure in my head. Alternatively I can run the Pandas program in a debugger, set a breakpoint and inspect the dataframe.

With all you need to do some work, but I find the Pandas one more involved because you don't have an authoritative "reference", just an initial state then some transformations. With the Pandas example I have to run the program (in my head or actually). The program might need to pull in test data (hopefully some has been provided). The worst is when the structure of the DF is derived from the source data rather than stated in code (e.g. reading in a CSV). It's much more to do than looking at a class definition or declarative database schema; there's a "sequence" to it, there are transformation steps happening to the DF that I need to keep track of.

As for the transformation thing, I'm totally on board with the need to transform data. What I'm specifically objecting to is the pattern of changing a variable's type during the program, which is extremely common in Pandas code. That is, reassigning the same variable with a value that has a different structure or types.

Here's a really common example where we select a subset of fields:

    df = ...
    df = df[["field1", "field2"]]
The DF has been transformed to have fewer columns than it did previously. Representing as types, it went from List[Dict[Literal["field1","field2","field3"], int]] to List[Dict[Literal["field1","field2"], int]]. We can no longer rely on field3 existing in the DF. Because this one variable has two possible types depending on where you are in the program, it's much harder to reason about.

This is a totally valid way to transform the data, but the manner in which the transformation is happening, I find, makes the code harder to reason about. And this is the manner of transformation I find most commonplace in Pandas practice. We could instead do the following, but I don't see it much:

    df = ...
    df_limited_fieldset = df[["field1", "field2"]]
And even in this case, to infer the structure of df_limited_fieldset, you need to determine the structure of df and then apply a transformation to it, unless you explicitly document its new structure somehow. With dataclasses, df_limited_fieldset would contain instances of an entirely new dataclass, stating its new type.

None of this is to say that abuse of dynamic types doesn't happen in normal Python, it totally does, but I've found these patterns to be so ingrained in Pandas culture / common practice that I'm comfortable characterising them as part of the nature of the tool.