Hacker News new | ask | show | jobs
by valw 1465 days ago
Btw, am I alone in thinking that DataFrame abstractions in OOP languages (like Pandas in Python) are oftentimes simply inferior to relational algebra? I'm not sure that many Data Scientists are aware of the expressive power of SQL.
4 comments

There are loads of things that are not possible or are very cumbersome to write in SQL, but that pandas and many other dataframe systems allow. Examples are dropping null values based on some threshold, one-hot encoding, covariance, and certain data cleaning operations. These are possible in SQL but very cumbersome to write. There are also things that are outright impossible in a relational database related to metadata manipulation.

SQL is super expressive, but I think pandas gets a bad rap. At it's core the data model and language can be more expressive than relational databases (see [1]).

I co-authored a paper that explained these differences with a theoretical foundation[1].

[1] https://arxiv.org/abs/2001.00888

Thanks for sharing this. I believe we essentially agree: chaining method calls is inexpressive compared to composing expressions in an algebraic language.
I'm not defending Pandas but just want to point out that the inability to conveniently compose expressions is one of the biggest problems with SQL, since it was designed to be written as a sort of pseudo-English natural language, in an era when people imagined that it would be used by non-programmers. To be clear, that's a problem with SQL, not with the idea of a language based on relational algebra. There are various attempts to create SQL-alternatives which behave like real programming languages in terms of e.g. composability. This blog post makes the point better than I can:

https://opensource.googleblog.com/2021/04/logica-organizing-...

I absolutely agree - one of the biggest shortcomings of SQL is that its primary programming interface is based on text and intended for human, instead of being based on data structures and intended for programs.
SQL does not exactly implement relational algebra in its pure form.

SQL implements a kind of set theory with relational elements and a bunch of practical features like pivots, window functions etc.

Pandas does the same. Most data frame libraries like dplyr etc. implement a common set of useful constructs. There’s not much difference in expressiveness. LINQ Is another language around manipulating sets that was designed with the help of category theory, and it arrives at the same constructs.

However SQL is declarative, which provides a path for query optimizers to parse and create optimized plans. Whereas with chained methods, unless one implements lazy evaluation one misses out on look aheads and opportunities to do rewrites.

> There’s not much difference in expressiveness

> However SQL is declarative

Pick one :) the way I see it, if declarativeness is not a factor in assessing expressiveness, then expressiveness reduces to the uninteresting notion of Turing-equivalence.

Expressiveness and declarativeness are different things, no?

Are you talking about aesthetics? I’ve used SQL for 20 years and it’s elegant in parts but it also has warts. I talk about this elsewhere but SQL gets repetitive and requires multi layer CTEs to express certain simple aggregations.

Agree. I've completed data pipelines for several projects and have found that the cleanest, and often fastest solution is to use SQL to structure the data as needed. This is anecdotal and I'm not an expert with SQL, but I haven't come across a situation where R or Pandas dataframes worked better than a well written query for data manipulation. This has the benefit of simplifying collaboration across teams because within my company not everyone uses the same toolset for analysis, but we all have access to the same database. Other tools are better suited to analysis or expansion of the data with input from other sources, but within our own data SQL wins.
Often -- yes. Always -- no.

For example let's try changing/fixing sampling rate of a dataset (.resample() in Pandas).

Or something like .cumsum() -- easy with SQL windowing functions, but man they are cumbersome.

Or quickly store the result in .parquet.

But all the above doesn't matter, because I feel like 99% of Pandas work involves quickly drawing charts on the data look at it or show to teammates.