| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by RadiozRadioz 586 days ago

As for the transformation thing, I'm totally on board with the need to transform data. What I'm specifically objecting to is the pattern of changing a variable's type during the program, which is extremely common in Pandas code. That is, reassigning the same variable with a value that has a different structure or types.

Here's a really common example where we select a subset of fields:

    df = ...
    df = df[["field1", "field2"]]

The DF has been transformed to have fewer columns than it did previously. Representing as types, it went from List[Dict[Literal["field1","field2","field3"], int]] to List[Dict[Literal["field1","field2"], int]]. We can no longer rely on field3 existing in the DF. Because this one variable has two possible types depending on where you are in the program, it's much harder to reason about.

This is a totally valid way to transform the data, but the manner in which the transformation is happening, I find, makes the code harder to reason about. And this is the manner of transformation I find most commonplace in Pandas practice. We could instead do the following, but I don't see it much:

    df = ...
    df_limited_fieldset = df[["field1", "field2"]]

And even in this case, to infer the structure of df_limited_fieldset, you need to determine the structure of df and then apply a transformation to it, unless you explicitly document its new structure somehow. With dataclasses, df_limited_fieldset would contain instances of an entirely new dataclass, stating its new type.

None of this is to say that abuse of dynamic types doesn't happen in normal Python, it totally does, but I've found these patterns to be so ingrained in Pandas culture / common practice that I'm comfortable characterising them as part of the nature of the tool.