Hacker News new | ask | show | jobs
by bradford 746 days ago
I view SQL right now similar to the way I viewed C++ in early 2000s:

I hate it, but there's little point in complaining because it's so ubiquitous.

More robust criticism is provided here (https://carlineng.com/?postid=sql-critique#blog), which pulls on an interview here (https://www.red-gate.com/simple-talk/opinion/opinion-pieces/...) The quote I usually drag out is from Chris Date, who helped pioneer relational DBs:

"At the same time, I have to say too that we didn’t realize how truly awful SQL was or would turn out to be (note that it’s much worse now than it was then, though it was pretty bad right from the outset)."

As an example of a language that does it better, I think kusto-query-language (KQL, https://learn.microsoft.com/en-us/azure/data-explorer/kusto/...) has been a dream to work with. (disclaimer, Kusto is a Microsoft product, and I'm a Microsoft employee).

4 comments

I kind of disagree; the problem with SQL is that fundamentally it's actually pretty good. So alternatives either tend to be too radical (throwing out the baby with the bathwater) or simply not enough of an improvement to gain any momentum.

I feel like rational database querying is effectively solved and there's little point in re-litigating it. But still I'd be happy to switch to the perfect replacement if someone develops it.

What does "good" mean in this context? SQL is not modular, most features are highly context-dependent and there numerous handguns.

Sql might be ok for trivial things, as in OLTP that programmers tend to work with.

But anything even slightly more advanced is... not nice.

And the standard is unique in its uselessness.

The underlying relational algebra model is brilliant thought.

> SQL is not modular, most features are highly context-dependent...

Examples?

> SQL might be OK for trivial things, as in OLTP...

What is the threshold for triviality? I've seen understandable fairly complex queries, but they're not mind-twisters by any means; if you know what you need, and understand your data, >>and are not a layperson regarding databases<< it's doable without much sweat.

> But anything even slightly more advanced is ... not nice

Again, what is the threshold, or at least what is your threshold, for triviality vs. non-trivial?

You say it's "unique in its uselessness" but "the underlying relational algebra model is brilliant", can you explain a bit further what you mean by that?

Good means it gets the job done in a fairly logical and readable way. SQL queries are not giant programs and shouldn't be. I've written some very advanced queries with plenty of common table expressions, subselects, etc. Could it be more modular? Sure. Could the syntax be better? Yes. But would that radically change how queries are written? Not really.

The worst SQL I've ever seen is when someone attempts to program it imperatively. It takes a different mentality to write SQL then to write imperative code.

> But would that radically change how queries are written? Not really.

Rust didn't radically change how applications are written as compared to C, but that didn't stop us. Nor should it. Any improvement is worthwhile. It doesn't need to be radical.

But, like another commenter points out, SQL is like Javascript. Both having ecosystems so horrendously conceived that they have ensured there is no good path to replacing/augmenting them.

Rust hasn't swept the world yet and it helps prevent real bugs and security issues. A better query language may make it fractionally easier to write database queries but you have to toss out a half-century of experience. It's not worth it for marginal gains. This isn't even opinion, this is the reason it has never happened.

I think I agree that SQL is like JavaScript. If JavaScript wasn't as expressively powerful as it is, it would have been replaced a long time ago. But it's actually good enough, despite it's quirks, that there doesn't exist a language better enough to make it worth replacement. It's possible such a language might never exist. And both SQL and JavaScript continue to improve sometimes directly stealing ideas from potential competitors.

> If JavaScript wasn't as expressively powerful as it is, it would have been replaced a long time ago.

No, it wouldn't matter how terrible or in-expressive the language is - once it got rolled out to the web browser, then it would never be replaced. That is why you have WASM nowadays - you can add new stuff but never replace stuff.

> Rust hasn't swept the world yet and it helps prevent real bugs and security issues.

As the inventor of the relational model has written about extensively, and as you have no doubt came to realize yourself if you've used SQL for more than a few minutes, a different query language could have prevented a whole lot of real bugs too. SQL also has its fair share of security problems that are only prevented by telling developers to be careful.

Rust will never sweep the world, of course, because there is no reason to choose a single language in the application space. Something sweeping the world tells that you royally screwed up the execution environment. But it is a viable contender, despite being no different than C in any meaningful way (clearly you don't see bug/security issue prevention as being meaningful).

> despite it's quirks, that there doesn't exist a language better enough to make it worth replacement.

As you know, Postgres went in the opposite direction, eventually switching to SQL. A DMBS – one which is probably the second most popular DMBS in existence at that – completely upending what query language it supports is not without precedent. What do you think it was about SQL that made it substantially better to justify the change?

SQL isn't the nicest language to write, but I do find it one of the most readable languages out there.

> The worst SQL I've ever seen is when someone attempts to program it imperatively. It takes a different mentality to write SQL then to write imperative code.

Use to any me when my team lead who was a fairly shitty programmer used to describe queries to me in an imperative way. "IF, blah blah blah, THEN". You need to think of your queries as "Return this to me WHEN". Once you get your head around this, there is something quite elegant about the relational model.

All of this is true.

"Being ok for trivial things... that programmers tend to work" with is precisely why SQL has won over the years. SQL has never been a great general-purpose language, but is fantastic at the things programmers tend to work with. That SQL (and associated databases) keeps getting extended to meet new data storage/retrieval problems has really helped SQL stick with us.

Honestly for OLAP use cases, I think dataframe apis are superior.
I gained a a few years of experience in SQL-based OLAP systems at my current job. In this time I developed a strong appreciation for SQL, especially for its composability. Recently, I started a project in Google Colab, gluing together queries from several systems with Pandas DataFrames. I can honestly say that I've never been more frustrated learning an API than I have with Pandas.

Need some window function like LAG() or LEAD()? Too bad, I hope you like writing Python "for i in range(...):" loops. My notebook is littered with ".reset_index()" calls, ".replace(np.nan, None)", "axis='columns'", "foo.assign(bar=lambda df: df.apply(lambda row: ...))". groupby is especially confusing to me, as a Pandas GroupBy is difficult to compose with a normal DataFrame until you call .reset_index(). Compare this to SQL, where a subquery is a subquery, whether or not it has a GROUP BY clause.

The Pandas documentation also leaves a lot to be desired. Take the documentation of pandas.NaT[1] for example. "pandas.NaT: alias of NaT". Ok? That still doesn't tell me what NaT is, nor does it link to the thing that it aliases. The groupby documentation[2] also caused me some headaches, as it covers only the simplest aggregation use-cases.

Pandas is clearly better for some use-cases, but mostly for simple operations that are well-supported by the API (perhaps numeric operations that are implemented with native numpy routines). But if I'm doing some interactive OLAP stuff, I'll reach for SQL. Perhaps the problem is I'm trying to use Pandas like it's SQL, when it's not. But for manipulating data, I'd rather use a language than a library.

[1] https://pandas.pydata.org/docs/reference/api/pandas.NaT.html [2] https://pandas.pydata.org/docs/user_guide/groupby.html

edit: half a sentence

I should've specified polars, ibis, dplyr, pyspark, etc.

Pandas is easily the worst dataframe api.

I'll never go back to SQL from polars, it's far superior in both composability and readability imo.

Not to mention complex transforms can be version controlled and unit tested, and then you can compose these together.

It also maps to/from SQL quite naturally.

I think the fundamental problem is that if you want to talk to databases, you have to speak SQL. Like Javascript on the web, you are stuck with what the platform provides. Both languages have significant deficiencies, yet is/was the only game in town.

Sure, there are some languages which can compile down to SQL, but like Typescript, every once and a while, you find some edge case where the transpilation fails you and you might as well be an expert in SQL.

I love all of the ideas of PRQL, but I do not know if I would be painting myself into a corner adopting something that will go the way of CoffeeScript.

I love SQL, I think it's fantastic, it's also the one skill I learned 30 years ago that still applies today and that I, personally, have used across at least six different databases.

The critiques in that link show a fundamental lack of understanding, eg. it poses that the following query should be allowed and wonders why sql complains about fetching more than one row since avg requires more than one row, but the reality is that what they are describing is avg(rows of rows) rather than avg(rows) which obviously doesn't make sense:

   SELECT
     AVG(
       SELECT SUM(amount) FROM purchases GROUP BY customer
     )
But even worse, the article says that the only solution is using a CTE(!?) and doesn't mention the obvious use of a subquery:

  SELECT
    AVG(total)
  FROM (
    SELECT
      SUM(amount) total_per_customer
    FROM
      purchases
    GROUP BY customer
  );
In my experience I've seen that usually the people that find it difficult to understand SQL are also the people that will be writing .fetch_by_id functions instead of .fetch_by_filters, ie. they consider database tables and data in general as 1d arrays in imperative programming and only think in those terms rather than the 2d sets they are. I'd wonder how their mind would melt if they'd tried to understand querying for 3d data, eg. temporal databases.

Also I fail to see how KQL is any different to SQL, it's pretty much the same thing with the exception that the selection comes afterwards, is that the factor that makes you hate SQL?

From the link you posted:

  StormEvents 
  | where StartTime between (datetime(2007-11-01) .. datetime(2007-12-01))
  | where State == "FLORIDA"  
  | count 
vs

  FROM   StormEvents
  WHERE  StartTime BETWEEN '2007-11-01' AND '2007-12-01'
  AND    State = 'FLORIDA'
  SELECT count(*)
and

  StormEvents
  | where DamageCrops > 0
  | summarize
      MaxCropDamage=max(DamageCrops), 
      MinCropDamage=min(DamageCrops), 
      AvgCropDamage=avg(DamageCrops)
      by EventType
  | sort by AvgCropDamage
vs

  FROM     StormEvents
  WHERE    DamageCrops > 0
  SELECT   max(DamageCrops) MaxCropDamage, 
           min(DamageCrops) MinCropDamage, 
           avg(DamageCrops) AvgCropDamage
  GROUP BY EventType
  ORDER BY AvgCropDamage
To be clear, I don't find SQL difficult to understand. I've used it for 20+ years and I can always get the query to generate my desired output. But I often find that the language is a hindrance, and that I can more efficiently reach my desired output using modern languages.

for example, here's the KQL equivalent to the 'average of sums' query:

  purchases
  | summarize total_per_customer=sum(amount) by customer
  | summarize avg(total_per_customer)
I find this more elegant, and I'd prefer authoring it over any of the equivalent SQL solutions previously mentioned.
I only ever need to know KQL when something isn't working well, which turns 1 problem into 2: the original problem, and how to express exactly what I need in a language that, when things are going well, I forget quickly.
I've definitely had my troubles with KQL, but I still find it more memorable than SQL. Interesting that some folks feel the opposite.

My most recent favorite KQL trick for debugging from logs is `autocluster`. Maybe you'll also find it useful when you're in trouble :)