|
Does anyone else find the Polars syntax kind of clunky and ambiguous? For example, from the link, here's how Polars and Pandas handles manipulating data in a subset of a dataframe: f = pl.DataFrame({'a': [1,2,3,4,5], 'b':[10,20,30,40,50]})
# Polars
f.with_column(
pl.when(pl.col("a") <= 3)
.then(pl.col("b") // 10)
.otherwise(pl.col("b"))
)
# Pandas
f.loc[f['a'] <= 3, "b"] = f['b'] // 10
Its not clear in the Polars approach that the column "b" is being modified. An additional minor nitpick here is the use of when/then/otherwise for their conditional logic. Aren't these just if/else-if/else conditions? It's seems more in line with mathematical/python convention to use if/else... am I missing something?The Pandas equivalent, on the other hand, is much more concise, and more explicit. It also seems more mathematical to me. Polars mutates the dataframe, whereas in Pandas a function is applied to a dataframe indexed like a matrix. Pandas also benefits from it's reliance on symbolic notation, it makes everything visually clearer, whereas in Polars, the use of pl.col("b") and other similar methods contribute to multiple nested brackets and redundant naming calls contributing to less interpretability. I know there's a lot of thought thats been put into Polars, so I assume I'm missing some of the advantages of the Polars approach, and would appreciate anyone who can shed some light on it. I do understand, and partially agree, with the idea that indexing in Pandas leads to a lot of bugs. But in the example above, Pandas isn't really using indexing, it's using a boolean map to "index" the values from the same dataframe, so should be fairly robust. Is there a reason why Polars is trying to avoid this kind of filtering in the row/column indices? |
> Aren't these just if/else-if/else conditions? It's seems more in line with mathematical/python convention to use if/else... am I missing something?
Yes, they are. But if you look at pandas `f['a'] <= 3` a boolean mask is created on eagerly, on the fly. Pandas has zero chance to do anything clever here.
And yes, `when.then.otherwise` is exactly `if else`, but if `if else` is already a keyword in python so we cannot use them. `when, then, otherwise` are close synonyms.
The benefit of using the `when().then().otherwise()` expression is that it is lazy. We don't do anything until we need to materialize the result. Then the optimizer has a chance to see the query a a whole and determine if the `mask` can be reused, is not needed, should be done somewhere else, etc.
> Polars mutates the dataframe,
Almost all polars methods are pure. There will be no dataframe mutated, but a new dataframe created.
> Is there a reason why Polars is trying to avoid this kind of filtering in the row/column indices.
Yes there is. Ambiguity. I want things to be explicit. So the method names should make clear that you are selecting rows:
`df.filter`
or selecting columns:
`df.select`
or slicing
`df.slice`
In pandas this can all be done with bracket notation. I often read code something like this
`df[foo] = bar` and wondered what kind of datatype was stored into `foo`.
Indexes has the same read complexity. I often read/saw queries that showed a different outcome after a `reset_index` call. I like things to be more explicit. This may cost some keystrokes, but future me/us can more easily understand what is going on.