Hacker News new | ask | show | jobs
by ritchie46 1260 days ago
Polars author here.

> Aren't these just if/else-if/else conditions? It's seems more in line with mathematical/python convention to use if/else... am I missing something?

Yes, they are. But if you look at pandas `f['a'] <= 3` a boolean mask is created on eagerly, on the fly. Pandas has zero chance to do anything clever here.

And yes, `when.then.otherwise` is exactly `if else`, but if `if else` is already a keyword in python so we cannot use them. `when, then, otherwise` are close synonyms.

The benefit of using the `when().then().otherwise()` expression is that it is lazy. We don't do anything until we need to materialize the result. Then the optimizer has a chance to see the query a a whole and determine if the `mask` can be reused, is not needed, should be done somewhere else, etc.

> Polars mutates the dataframe,

Almost all polars methods are pure. There will be no dataframe mutated, but a new dataframe created.

> Is there a reason why Polars is trying to avoid this kind of filtering in the row/column indices.

Yes there is. Ambiguity. I want things to be explicit. So the method names should make clear that you are selecting rows:

`df.filter`

or selecting columns:

`df.select`

or slicing

`df.slice`

In pandas this can all be done with bracket notation. I often read code something like this

`df[foo] = bar` and wondered what kind of datatype was stored into `foo`.

Indexes has the same read complexity. I often read/saw queries that showed a different outcome after a `reset_index` call. I like things to be more explicit. This may cost some keystrokes, but future me/us can more easily understand what is going on.

1 comments

> Yes, they are. But if you look at pandas `f['a'] <= 3` a boolean mask is created on eagerly, on the fly. Pandas has zero chance to do anything clever here.

Isn't this just an implementation detail? It seems like it wouldn't be tough to turn this into syntactic sugar rather than a forced eager evaluation. IE, `f['a'] <= 3` could just as easily evaluate into a computation graph rather than the evaluation of that graph. For example, I could imagine something like so:

```

from polars.dataframe import LazyDataFrame, DataFrame

def fn():

  ...

  ldf = LazyDataFrame(df)
  # this mutates the computation graph but doesn't evaluate
  ldf.loc[f['a'] <= 3, "b"] = f['b']
  df = DataFrame(ldf)
  return df
```

This is a toy example so I'm not sure if the part around evaluation makes complete sense, but it seems like how pandas eagerly evaluates the frame is a shortcoming of its implementation and model, rather than the syntactic sugar itself.

To be even more specific, this is the way SQLAlchemy does it. You could have something like this:

```

from models import Contact

def fn():

  ...

  # doesn't evaluate; could trivially be done as Contact[Contact.name == 'John']
  filtered_contact_exp = Contact.filter(Contact.name == 'John')
  # actually evaluates
  filtered_contacts = filtered_contact_exp.all()
  return filtered_contacts
```

And SQLAlchemy knows not to actually trigger the evaluation until you do something like `.all()`. Why not adopt this kind of pattern with Polars?