Hacker News new | ask | show | jobs
by oaiey 1368 days ago
After seeing one diagram: Why the hack are we still talking to databases with SQL strings and not directly specifying the Query-AST? Admins, sure (a fancy UI could help there as well) but why in our code?
13 comments

I believe most sql engines cache the query plans for parameterized queries, which would cover the majority of requests.

Caching the query plan is also going to go further in performance optimisations than just “precompiling” the SQL to a AST.

Because query parsing time is totally insignificant compared to query IO?

I mean, I get it but the chances that it makes a noticeable difference are zero in almost every case. Also you'd have to change a lot of the existing tooling, at which point you might as well send a compiled agent or use stored procs?

> Because query parsing time is totally insignificant compared to query IO?

The problem there is that the SQL query string is not parsed at compile time of the host program, so things that could be caught at compile-time are not, and things like appending strings to SQL strings in an unsafe way are much too easy to do.

That's why there are query parameters (see https://www.sqlite.org/lang_expr.html#varparam for the comprehensive SQLite implementation) and automatic escaping. Not to mention tests and code reviews.
Yes, those are other ways to try and solve this problem.
The irony is that the SQL Standard actually specifies two different generally viewed as obsolete methods for handing this.

One is "SQL/CLI", which envisions that you provide a "module" of queries (parameterized and static), which you then compile with some database compiled tool, resulting in an object file you can link into your codebase that exports function calls for these queries.

The other is "Embedded syntax", which is basically "embed RAW SQL statements into your program, run a database provided pre-processor to convert it to normal ADA, C, COBOL, FORTRAN, MUMPS, Pascal, or PL/I code, which is then compiled normally. In theory, this is supposed to generate a "client module" like SQL/CLI, and insert the needed code to call into that module in place of the SQL.

For both of these the queries are considered to be "prepared" at compilation time, but this mostly amounts to syntax checking. It could theoretically also provide warnings about possible execution errors based on the current state of the database (for example, if you are trying to select some column that does not currently exist in some given table), but it could not treat these as errors since DML statements run by this or other programs may change the table before the query is actually executed.

The sql standard calls every form of passing a string version of a query to a database as "dynamic sql" (Not just say SQL constructed in say a store procedure and called with EXECUTE). This is because the database cannot statically analyze those queries at program compilation time, so you may need to link with additional libraries that include the query parser, etc. And supporting "dynamic sql" at all is an optional feature.

We are... Spark's DataFrame is essentially a relational algebra AST-builder. Microsoft's LINQ interprets SQL directly in at compile-time. All of these, however, run queries more or less directly in the system in which they're specified.

It helps to think of SQL strings as an untrusted wire format. Yes, parsing is a pain, but it comes with two main benefits: (i) The wire format is human writable/interpretable, with all the accompanying benefits, and (ii) The wire format is easily extensible in a predictable way.

That latter one is particularly useful in keeping SQL's ecosystem open. Take a front-end library like SQLAlchemy or ScalikeJDBC for example. It's not practical for any one such library to support every extension provided by every database engine. SQL provides a fall-back for when you need a back-end feature that hasn't been implemented in any given front-end.

C# LINQ does pass an expression tree into the abstraction before it will then serialize it SQL and then the database deserialize it into an AST. LINQ-to-Objects is in memory and works on the AST directly.

Also both LINQ language syntax and library methods are a builder paradigm for the expression tree. Valid, but still far from ideal representation of an AST.

> Why the hack are we still talking to databases with SQL strings and not directly specifying the Query-AST?

The same reason language servers took off. Instead of one to one mapping, SQL enables one to many mapping with minor tweaks, allowing everyone to do whatever they want over a well known, well defined, mature abstraction.

In the same spirit, I may ask why we're not writing assembly or even machine code, and we have programming languages? Testers, sure, abstraction means clarity up to an extent, but why the developers themselves still use programming languages?

You’re missing the point if you think that that is “in the same spirit”.

SQL-as-strings and SQL-as-AST are still the same thing. What is being proposed it not to write procedural code for record retrieval instead of declarative SQL.

Isn't that what a parameterized query does? It separates the sql logic from the inputs so it can cache the query and then it accepts inputs separately. Safer and more optimal at the same time, the engine doesn't have to re-optimize the same query again for the life of the connection. If my understanding is wrong somebody please correct me, it's kind of hard to get good information on what's going on under the hood with these things.
A SQL query is an AST, but represented in a compact portable form. It also supports functions, procedures, and parameterization for flexible and safe query construction.

Your code would get incredibly large and complicated if you had to specify any serious SQL query as a raw AST.

Check out prql[1], it might be a conceptual model you'd like.

[1] https://prql-lang.org/

That is just modern SQL with horrible LINQ memories ;)
Most queries to kdb+ do exactly what you are asking for.
Can you give an example of what you mean, and what we'd gain from it?
When you're accessing a SQLite database in code you have to generate a query string. Parameters ameliorate that somewhat, but in many cases you still have to regenerate a new string for each new query. It's inefficient to translate your query into a string only to have it parsed back into something structured by SQLite.
Are you suggesting an alternate syntax for SQL-like queries, which is more compact?

Or a specific one for SQLite?

I'd be very surprised if generating the SQL query string and parsing it again was more than a trivial percentage of the query execution time, but happy to be proven wrong.

Probably, but it's still inefficient in other ways. There's memory issues and the code overhead of dealing with strings. It'd be much nicer to have a binary interface, but admittedly it would be much more complex.

The real solution is to include an alternative to SQL that looks and works like Datalog including things like variables. That would make SQLite 100x more productive for programmers. But that will never happen.

Wouldn't that then just be a prepared statement with variables?
As I said in another part of the thread, I was in a scenario where we were performed millions of inserts into a table of four integers, one row at a time. Generating the string and parsing it again wound up being enough to blow our 10µs time budget.
But with parameters you don't need to do that, or in any other scenario where only (literal) values change.

On the other hand, if you were importing data and each line specified which table and column it had to go into you'd probably have to write a new SQL statement each time.

What was the time constraint due to, out of curiosity?

What did you do for a solution?

It was a neutron detector that stored the the position, energy, and time stamp into the database. The design spec was 10,000 neutrons per second, so that was the origin of the time limit.

Of course, half those neutrons arrived within a 20ms window, so we had a buffer to handle the load. However, however, if the average remained above the limit, the buffer would fill up. There was a brief discussion of just ignoring events when the buffer was full, but that could introduce systematic errors in the data that would be impossible to detect, so it was better for the database to just crash.

The solution was to tighten the slits on the neutron beam to lower the count rate to the point that we never filled the buffer. Granted, we were testing a high flux technique that, so that was a bit of a disappointment. Everything else in the instrument could handle an order of magnitude more neutrons, except this database.

By the way, to be fair to the database designers, they were working with Clinton era tech and were probably told that the flux rate would never be a SUSTAINED 10,000 events per second.

SQLite was released in the year 2000; it is 22 years old.

Because of its age, it does carry deprecated API components that are maintained solely for backward compatibility.

Notice how many _v2 and _v3 variants are present, denoting reworked aspects of the API:

https://sqlite.org/c3ref/funclist.html

A product of this age was designed (and redesigned) for specific needs. Unfortunately, your use case is not among them.

Is that a fact or an intuition?
Not the OP, but I can give two gains. First off, by passing an AST, instead of just an SQL string, we cut out a huge number of possibly SQL injection attacks. Second, in most of the projects where I've used of SQL, there's been some kind of database object that builds the actual query, which it then converts to a string. The database then takes that string and parses it into an AST. There's some performance gains to be made by skipping the middle man and just creating the AST directly.

Many years ago, I was on a project that needed to add rows to a database with a hard 10µs limit. Each rows was just four integers, so the writing part was trivial. However, allocating the string, formatting the integers to strings, then parsing the resulting string often put us over the time limit. Every time the time limit was breached, we lost about five grand. Why we were using an SQL database for this at all is a story for a different time.

You should be using prepared statements and variable bindings. You should never have to parse any statement more than once (many dynamic language interfaces even cache statements for you, and many SQLite libraries also offer an interface for this out of the box). You should also never be formatting integers into strings, but simply binding them to the prepared statement. I really hope you weren't interpolating them and formatting each statement with the literal values in the string. That would be horridly inefficient, and a bad misuse of any modern SQL database. I also hope you were properly using transactions to speed up your batches.

See: https://www.sqlite.org/c3ref/stmt.html

Having application build a string and pass it to a library which parses the string into an AST cannot be as efficient as just building the AST, right?
It's a lot more efficient for the human who is used to reading SQL to read the SQL (or even SQL-producing code) than AST
Type safety at compile time to begin with.
I’ve been trying to do exactly this. It must be possible.
PREPARE?
Didn't Microsoft LINQ try something like that? Was not particularly successful.
LINQ stands for Language-Integrated Query and is incredibly successful at its purpose of providing powerful querying functionality and extensions baked into the C#/.NET language space itself.

This querying framework is what powers translations and compilation into SQL and several other languages (depending on the datastore provider used).

EntityFramework is one of the most advanced ORMs out there and is supremely productive because of LINQ.

LINQ is stunningly successful, LINQ Query syntax (what you probably are referring to) with LINQ to SQL less so.

And yes, as usual, we have the amazing confusion of Microsoft Naming.

But query syntax is essentially just a way to use an ORM that looks closer to SQL but is strongly typed.

LINQ creates an AST in the .NET land, however, before passing it to an actual SQL database, it serializes the expression tree to SQL.

So no, it does not talk AST to the database.

No, it never tried that.
Why not?
Same reason you don't read files by concating string commands into another different language and then posting them to the OS.