Hacker News new | ask | show | jobs
by saeranv 1260 days ago
Does anyone else find the Polars syntax kind of clunky and ambiguous?

For example, from the link, here's how Polars and Pandas handles manipulating data in a subset of a dataframe:

  f = pl.DataFrame({'a': [1,2,3,4,5], 'b':[10,20,30,40,50]})
  # Polars
  f.with_column(
      pl.when(pl.col("a") <= 3)
      .then(pl.col("b") // 10)
      .otherwise(pl.col("b"))
  )
  # Pandas
  f.loc[f['a'] <= 3, "b"] = f['b'] // 10
Its not clear in the Polars approach that the column "b" is being modified. An additional minor nitpick here is the use of when/then/otherwise for their conditional logic. Aren't these just if/else-if/else conditions? It's seems more in line with mathematical/python convention to use if/else... am I missing something?

The Pandas equivalent, on the other hand, is much more concise, and more explicit. It also seems more mathematical to me. Polars mutates the dataframe, whereas in Pandas a function is applied to a dataframe indexed like a matrix. Pandas also benefits from it's reliance on symbolic notation, it makes everything visually clearer, whereas in Polars, the use of pl.col("b") and other similar methods contribute to multiple nested brackets and redundant naming calls contributing to less interpretability.

I know there's a lot of thought thats been put into Polars, so I assume I'm missing some of the advantages of the Polars approach, and would appreciate anyone who can shed some light on it.

I do understand, and partially agree, with the idea that indexing in Pandas leads to a lot of bugs. But in the example above, Pandas isn't really using indexing, it's using a boolean map to "index" the values from the same dataframe, so should be fairly robust. Is there a reason why Polars is trying to avoid this kind of filtering in the row/column indices?

5 comments

Polars author here.

> Aren't these just if/else-if/else conditions? It's seems more in line with mathematical/python convention to use if/else... am I missing something?

Yes, they are. But if you look at pandas `f['a'] <= 3` a boolean mask is created on eagerly, on the fly. Pandas has zero chance to do anything clever here.

And yes, `when.then.otherwise` is exactly `if else`, but if `if else` is already a keyword in python so we cannot use them. `when, then, otherwise` are close synonyms.

The benefit of using the `when().then().otherwise()` expression is that it is lazy. We don't do anything until we need to materialize the result. Then the optimizer has a chance to see the query a a whole and determine if the `mask` can be reused, is not needed, should be done somewhere else, etc.

> Polars mutates the dataframe,

Almost all polars methods are pure. There will be no dataframe mutated, but a new dataframe created.

> Is there a reason why Polars is trying to avoid this kind of filtering in the row/column indices.

Yes there is. Ambiguity. I want things to be explicit. So the method names should make clear that you are selecting rows:

`df.filter`

or selecting columns:

`df.select`

or slicing

`df.slice`

In pandas this can all be done with bracket notation. I often read code something like this

`df[foo] = bar` and wondered what kind of datatype was stored into `foo`.

Indexes has the same read complexity. I often read/saw queries that showed a different outcome after a `reset_index` call. I like things to be more explicit. This may cost some keystrokes, but future me/us can more easily understand what is going on.

> Yes, they are. But if you look at pandas `f['a'] <= 3` a boolean mask is created on eagerly, on the fly. Pandas has zero chance to do anything clever here.

Isn't this just an implementation detail? It seems like it wouldn't be tough to turn this into syntactic sugar rather than a forced eager evaluation. IE, `f['a'] <= 3` could just as easily evaluate into a computation graph rather than the evaluation of that graph. For example, I could imagine something like so:

```

from polars.dataframe import LazyDataFrame, DataFrame

def fn():

  ...

  ldf = LazyDataFrame(df)
  # this mutates the computation graph but doesn't evaluate
  ldf.loc[f['a'] <= 3, "b"] = f['b']
  df = DataFrame(ldf)
  return df
```

This is a toy example so I'm not sure if the part around evaluation makes complete sense, but it seems like how pandas eagerly evaluates the frame is a shortcoming of its implementation and model, rather than the syntactic sugar itself.

To be even more specific, this is the way SQLAlchemy does it. You could have something like this:

```

from models import Contact

def fn():

  ...

  # doesn't evaluate; could trivially be done as Contact[Contact.name == 'John']
  filtered_contact_exp = Contact.filter(Contact.name == 'John')
  # actually evaluates
  filtered_contacts = filtered_contact_exp.all()
  return filtered_contacts
```

And SQLAlchemy knows not to actually trigger the evaluation until you do something like `.all()`. Why not adopt this kind of pattern with Polars?

> Does anyone else find the Polars syntax kind of clunky and ambiguous?

I’ve used pandas a lot, but I’ve come to the opposite conclusion.

In my experience, these pandas expressions end up being bracket soup, and become increasingly fragile to hold in your head while you try and figure out just which n rows and columns you’re looking at.

Couple that with pandas opaqueness around copy-vs-view and the blurring of lines between API’s for selection, vs API’s for mutation and you get an unpleasant experience.

This particular pandas example is simpler, but it doesn’t take much IME for pandas df’s to end up far more unreadable.

I’ll gladly take polars saner API if it means I don’t have to play “data frame lisp bracket-matching” games ever again.

Started exploring machine learning in Python 6 months ago. Despite all the resources for learning Pandas I couldn't ever get to a point where it seemed coherent. It felt like a grab bag of tricks that accomplished various different jobs. Polars on the other hand felt really consistent and logical. Instead of having to google how to do so something in Pandas I could generally just figure out how to do it by combining the the simpler operations that Polars provides.
Yeah, tried Polars a couple of times: the API seems worse than Pandas to me too. eg the decision only to support autoincrementing integer indexes seems like it would make debugging "hmmm, that answer is wrong, what exactly did I select?" bugs much more annoying. Polars docs write "blazingly fast" all over them but I doubt that is a compelling point for people using single-node dataframe libraries. It isn't for me.

Modin (https://github.com/modin-project/modin) seems more promising at this point, particularly since a migration path for standing Pandas code is highly desirable.

To me it seems both Pandas and Polars sacrifice API for performance, just using different approaches to achieve that performance and thus differently bad API. There's obviously some amount of tradeoff there and no shame in tilting in scale in one direction, though it would be refreshing to be upfront and honest to users about that.

Additionally, Pandas seems an organically grown API. These days with more experience and more data frame implementations to learn from, it should be possible to do better, something I only partially see when looking at Polars.