Hacker News new | ask | show | jobs
by devin-petersohn 1646 days ago
This was my PhD focus. We identified a core "dataframe algebra"[1] that encompasses all of pandas (and R/S data.frames): a total of 16 operators that cover all 600+ operators of pandas. What you describe was exactly our aim. It turns out there are a lot of operators that are really easy to support and make fast, and that gets you about 60% or so of the way to supporting all of pandas. Then there are really complex operators that may alter the schema in a way that is undeterminable before the operation is carried out (think a row-wise or column-wise `df.apply`). The flexibility that pandas offers is something we were able to express mathematically, and with that math we can start to optimize the dataframe holistically, rather than chipping away at small parts of pandas that are embarrassingly parallel.

Most dataframe libraries cannot architecturally support the entire dataframe algebra and data model because they are optimized for specific use-cases (which is not a bad thing). It can be frustrating for users who may have no idea what they can do with a given tool just because it is called "dataframe", but I don't know how to fix that.

[1] https://arxiv.org/pdf/2001.00888

2 comments

Awesome work, thanks!
This is really cool! Thx for sharing