Hacker News new | ask | show | jobs
by rtpg 2603 days ago
I really feel like we need a lower level expression language than SQL, one based on actual query plans rather than a declarative thing.

Having that, and helping people learn that, would make it much harder to not “get” how joins and the like end up working

3 comments

We have SQL above that level and relational algebra below it; the challenge with your proposal is it's super-dependent on non-trivial implementation. Upper-level database courses deal with this, but in the same way compilier or operating systems do: you get to build functioning implementations but not realistic, performant ones. IMO opinion this is outside the scope of comp sci grad students or advanced undergrads not in a DB specialization.
Perhaps, as a learning aid, and definitely to help fix erroneous query plans, but you'd lose the ability for the DB to adapt to new situations such as added data, data distributions changing, the addition/removal of indexes etc.

Also it's downright hard to make a good query plan for a nontrivial query, and it gets exponentially harder.

Honestly the optimiser is in a better position to do the work.

On re-reading I think you're talking only about pedagogy in which case disregard the list-of-cons in my post. And I think it would be a damn good idea! I really like it.

Would you also force users to decide on the implementation of the join operator? Because that is one of the benefits of a declarative query, as well as placement of other operators and join ordering if multiple relations are involved.
Would that come with the benefit of not getting nasty surprises when the cantankerous scheduler suddenly decides to mess with a previously efficient solution? Or even protect me from myself in the future accidentally breaking an efficient query through an ostensibly innocent change?

I would appreciate a way to encode execution efficiency parameters in the semantics of the query itself. Specifying which indexes to use, or the runtime complexity of certain operations. This would also help me figure out which indexes I need to create, in the first place. Today I can add a clause in an order which makes using an index impossible, rendering a previously efficient query suddenly extremely expensive. The failure mode is very obscure: it all works, the DB just works overtime. I might not have a large enough DB yet to actually notice this, until one day I see these slow queries and have to backtrack all the way to that one commit that caused it, six months ago.

I have encountered this scenario a few odd times. It is not always obvious to someone changing a query that it is meant to use an index, and it is very hard for someone writing it to specify so in a unit test.

Flexibility for the scheduler to choose efficient strategies cuts both ways. I'd appreciate the ability to leave more explicit annotations about expected runtime characteristics, than mere comments offer.

This talk ("How Modern SQL Databases Come up with Algorithms that You Would Have Never Dreamed Of by Lukas Eder" https://www.youtube.com/watch?v=wTPGW1PNy_Y) is not particularly gripping, but he does claim that modern database engines will ignore indexes if they estimate it will be faster to read all the data than it is to access the index first then read the relevant data. And that Oracle can have multiple execution plans and switch between them mid-query if the first one isn't working out as expected.

Could this be a rare case of the mythical sufficiently smart compiler having more information at runtime than the programmer has at coding time, and that if you could hint which index to use, the trade off would be that you'd instead be troubleshooting queries which end up slower because of it?

> Could this be a rare case of the mythical sufficiently smart compiler having more information at runtime than the programmer has at coding time [...]

Mythical? Isn't this the core concept behind optimizing JITs?

> I would appreciate a way to encode execution efficiency parameters in the semantics of the query itself. Specifying which indexes to use,

That's a mistake. You can in MSSQL but you'd likely regret it (this from experience). Using an index can be a mistake; sometimes a table scan can be more efficient. And which, full scan or index, is more efficient is down to the data in the table and the query predicate.

(Edit, so which to do, scan or index, can be different for exactly the same query depending on the test:

  select * 
  from people 
  where surname = ?
can be either depending on whether surname is 'smith' (vey common) or 'hepsibah' (very rare) ).

> I'd appreciate the ability to leave more explicit annotations about expected runtime characteristics, than mere comments offer.

Interesting. Can you elaborate?