Hacker News new | ask | show | jobs
by obi1kenobi 2169 days ago
I wrote a thread on this on Twitter: https://twitter.com/PredragGruevski/status/12639165990625402...

I feel that SQL aimed to be Python and became x86 assembly instead. It's no longer a simple "just works" query language the moment you have to worry about predicate flattening, join decomposition, CTEs that introduce optimization barriers, and "IN()" being faster than equivalent "JOINs".

As a result, I started a project that allows you to write read-only database-agnostic queries called GraphQL compiler: https://graphql-compiler.readthedocs.io/ https://github.com/kensho-technologies/graphql-compiler

The core idea of the project is to get us the convenience of specifying the "what question I want answered," but without the inconvenience of "how is the answer computed / with which specific set of queries / where did the data come from?" -- unless you want to peek under the hood, of course. All the visibility into the nitty-gritty details available on demand, but without the tedium of having to hand-optimize queries and know all the "magic" ways in which queries get faster or slower for each individual kind of database.

2 comments

> The core idea of the project is to get us the convenience of specifying the "what question I want answered," but without the inconvenience of "how is the answer computed / with which specific set of queries / where did the data come from?"

So...exactly like SQL, then?

The post you're replying to directly addresses that. When you write SQL:

> It's no longer a simple "just works" query language the moment you have to worry about predicate flattening, join decomposition, CTEs that introduce optimization barriers, and "IN()" being faster than equivalent "JOINs".

Though it doesn't seem to address how to optimize things using the GraphQL compiler, when there's a need, without massaging the queries, as with SQL.

Just like how GCC and Clang/LLVM know all the quirks of various CPUs and can optimize accordingly, GraphQL compiler aims to know the quirks of various databases (down to individual database versions: e.g., in Postgres 12 certain kinds of CTEs are no longer an optimization barrier) and optimize accordingly.

This is clearly a massive challenge, but one made easier by the fact that GraphQL compiler queries (unlike SQL queries) operate at a much higher level of abstraction. In SQL, you write "here's a CTE, now recursively JOIN X.foo to Y.bar" where X and Y could be just about anything, even something where a recursive JOIN is nonsensical. If you want to put a WHERE clause, you have to decide whether it goes in the recursive CTE itself, in a separate CTE that is ordered before the recursive CTE, or if you want to wrap the recursive CTE into another SELECT and put the WHERE there. The correct answer varies from database to database, and as a function of the size, layout, and index coverage of your data.

In GraphQL compiler, your queries are much more declarative in comparison: your query would say "find all subgraphs where vertex A's field 'foo' has value 123, and where A has a recursively-expanded edge (i.e. 0+ hops along that edge) to a vertex with field 'bar' with value 456". It's then the compiler's job to figure out which of the many equivalent SQL statements (or other db language queries, if you aren't using SQL) is going to be the best way to compute the result you asked for.

Here's an example from our test suite: input query: https://github.com/kensho-technologies/graphql-compiler/blob...

Microsoft SQL Server-flavored compiled SQL output: https://github.com/kensho-technologies/graphql-compiler/blob...

I'm writing a blog post about this with more detail, follow me on Twitter if you'd like to see it when it comes out.

If you know all of the quirks of the various databases, why aren't you hacking on their optimizers? Why write a compiler that knows what's slow and what's not when you can just fix what's slow?
It's an "and" rather than an "either-or" :)

When the database is open-source, and I spot something that's broken that I know how to fix, I try to fix it. Here's a fix for a severe database query planner correctness bug I contributed to an open-source database called OrientDB: https://github.com/orientechnologies/orientdb/pull/7015

Unfortunately, Microsoft SQL Server, Oracle, and many other databases are not open-source, and I can't hack on their query planners. And even if they were, SQL is an absolutely massive language (the complete spec is 100,000+ pages). The GraphQL compiler query language is tiny in comparison, the spec is maybe 10 pages: https://graphql-compiler.readthedocs.io/en/latest/language_s...

It's a lot easier to intelligently map a small language to a big one than it is to optimize the big language outright.

In a sense, SQL is just not designed to be easy to optimize — it's too broad, and there are too many equivalent-ish ways of doing the same thing. This is why even after incredibly smart people cumulatively spent engineer-millennia on the query execution and optimization systems in SQL databases, we still keep having issues and there are still plenty of areas for improvement.

More info and more concrete examples of "why not just write SQL" in my upcoming blog post!

> In a sense, SQL is just not designed to be easy to optimize — it's too broad, and there are too many equivalent-ish ways of doing the same thing. This is why even after incredibly smart people cumulatively spent engineer-millennia on the query execution and optimization systems in SQL databases, we still keep having issues and there are still plenty of areas for improvement.

The main reason is the inherent difficulty of cardinality estimation, as the paper says.

Not every optimization is worth having. There is typically a distributed cost, paid in extra planner cycles for queries that don't benefit from the optimization. This is one of the main reasons why it's hard to contribute new optimizations to the Postgres planner. Naturally, it's possible that a marginal optimization will be incredibly important to one particular query or user. It's a value judgement in each case.

Frankly, I find the suggestion that SQL is not designed to be easy to optimize baffling.

I think the parent's point might have been: SQL tried with mixed success. What makes the GraphQL compiler different?
The user you're replying to presumably understands a bit about the cited inconveniences, as well as SQL, in general

> user: petergeoghegan > about: PostgreSQL major contributor, committer.

It’s a leaky abstraction, yes. In the ideal world, SQL query planners always would find the optimal plan in zero seconds and DDL would allow you to express non-functional requirements for various kinds of queries (“retrieving this record by ID should take at most 1ms”, “looking up related records in that table by order number should…”), but the world isn’t ideal. We have to manually create indices, split tables, move different tables to different disks, etc. for performance.

“The GraphQL compiler turns read-only queries written in GraphQL syntax to different query languages” is useful for GraphQL ‘fans’, but I don’t see how that is going to solve that.

Designing a user friendly query language for relational data isn’t the hard part. Executing such queries efficiently is.

For SQL, there’s half a century of research on that. This paper is part of it, and indicates that, at this moment in time, effort is better spent on methods for keeping statistics on data up to date than on making cost models more fine grained.

> Designing a user friendly query language for relational data isn’t the hard part. Executing such queries efficiently is.

I couldn't agree more with this part :)

To be clear, GraphQL compiler isn't trying to "solve" SQL itself, just merely the fact that by the time all the table splitting, sharding across disks and machines, and similar required maintenance operations are done, your SQL query has grown impractically complex and entirely unreasonable to write. At sufficient scale, and with sufficient additional non-SQL databases in play, writing adequate queries becomes wildly impractical.

In the GraphQL compiler world, all your databases (SQL and non-SQL) are represented in one unified schema against which you write database-agnostic queries, and GraphQL compiler handles the nitty-gritty details of "which query runs where." GraphQL compiler is not a toy project I build for fun — it's a core piece of data infrastructure that Kensho (company where I work) has been happily using in production for over 3 years now.

I'm writing a blog post about exactly this, and I hope to publish it very soon! Follow me on Twitter if you'd like to see it when it comes out.