| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brahbrah 1214 days ago

(Taken from an old comment of mine)

If you were to say “pandas in long format only” then yes that would be correct, but the power of pandas comes in its ability to work in a long relational or wide ndarray style. Pandas was originally written to replace excel in financial/econometric modeling, not as a replacement for sql. Models written solely in the long relational style are near unmaintainable for constantly evolving models with hundreds of data sources and thousands of interactions being developed and tuned by teams of analysts and engineers. For example, this is how some basic operations would look.

Bump prices in March 2023 up 10%:

    # pandas
    prices_df.loc['2023-03'] *= 1.1

    # polars
    polars_df.with_column(
        pl.when(pl.col('timestamp').is_between(
            datetime('2023-03-01'),
            datetime('2023-03-31'),
            include_bounds=True
        )).then(pl.col('val') * 1.1)
        .otherwise(pl.col('val'))
        .alias('val')
    )

Add expected temperature offsets to base temperature forecast at the state county level:

    # pandas
    temp_df + offset_df

    # polars
    (
        temp_df
        .join(offset_df, on=['state', 'county', 'timestamp'], suffix='_r')
        .with_column(
           ( pl.col('val') + pl.col('val_r')).alias('val')
        )
        .select(['state', 'county', 'timestamp', 'val'])
    )

Now imagine thousands of such operations, and you can see the necessity of pandas in models like this.

3 comments

wenc 1214 days ago

Point taken but most data wrangling these days — especially at scale — is of the long and thin variety (what is also known as 3rd normal form or tidy format — which actually allows for more flexibility if you think in terms of coordinatized data theory) where aggregations and joins dominate column operations (Pandas’ also allows array like column operations due to its index but there are other ways to achieve the same thing).

I typical do the type of column operation in your example only on subsets of data, and typically I do it in SQL using DuckDB. Interop between Polars and DuckDB is virtually zero cost so I seamlessly move between the two. And to be honest I don’t remember the last time I needed to do this but that’s just the nature of my work and not a generalized statement.

But yes if you are still in a world where you need to perform Excel like operations then I agree.

link

ritchie46 1214 days ago

This is far more elegant in pandas due to the implicit behavior of the index.

But you can move the explicitness of polars behind a function. A more explicit API should not hurt maintainability if we structure our code right.

link

brahbrah 1214 days ago

So something like this?

    def add(df1, df2, meta_cols, val_cols=None):
        # join on meta cols
        # add val cols (default to all non meta cols if None)
        # return df with all meta and val cols selected

In theory I think that's fine. The problem is that in practice this will cause a lot of visual noise in your models, since for every operation you would need to specify, at least, your meta columns, and potentially value columns too. If you change the dimensionality of your data, you would need to update everywhere you've specified them. You could get around this a bit by defining the meta columns in a constant, but that's really only maintainable at a global module level. Once you start passing dfs around, you'll have to pass the specified columns as packaged data around with the df as well. There's also the problem that you'd need to use functions instead of standard operators.

One thing that would be nice to do is set an (and forgive me, I understand the aversion to the word "index") index on the polars dataframe. Not a real index, just a list of columns that are specified as "metadata columns". This wouldn't actually affect any internal state of the data, but what it would do is affect the path of certain operations. Like if an "index" is set, then `+` does the join from above, rather than the current standard `+` operation.

In any case I realize this is a major philosophical divergence from the polars way of thinking, so more just shooting shit than offering real suggestions.

link

tehf0x 1214 days ago

Now imagine the other side of this equation, where pandas seems too clunky, behold YOLOPandas https://pypi.org/project/yolopandas/ i.e. `df.llm.query("What item is the least expensive?")`

link