After seeing one diagram: Why the hack are we still talking to databases with SQL strings and not directly specifying the Query-AST? Admins, sure (a fancy UI could help there as well) but why in our code?
Because query parsing time is totally insignificant compared to query IO?
I mean, I get it but the chances that it makes a noticeable difference are zero in almost every case. Also you'd have to change a lot of the existing tooling, at which point you might as well send a compiled agent or use stored procs?
> Because query parsing time is totally insignificant compared to query IO?
The problem there is that the SQL query string is not parsed at compile time of the host program, so things that could be caught at compile-time are not, and things like appending strings to SQL strings in an unsafe way are much too easy to do.
That's why there are query parameters (see https://www.sqlite.org/lang_expr.html#varparam for the comprehensive SQLite implementation) and automatic escaping. Not to mention tests and code reviews.
The irony is that the SQL Standard actually specifies two different generally viewed as obsolete methods for handing this.
One is "SQL/CLI", which envisions that you provide a "module" of queries (parameterized and static), which you then compile with some database compiled tool, resulting in an object file you can link into your codebase that exports function calls for these queries.
The other is "Embedded syntax", which is basically "embed RAW SQL statements into your program, run a database provided pre-processor to convert it to normal ADA, C, COBOL, FORTRAN, MUMPS, Pascal, or PL/I code, which is then compiled normally. In theory, this is supposed to generate a "client module" like SQL/CLI, and insert the needed code to call into that module in place of the SQL.
For both of these the queries are considered to be "prepared" at compilation time, but this mostly amounts to syntax checking. It could theoretically also provide warnings about possible execution errors based on the current state of the database (for example, if you are trying to select some column that does not currently exist in some given table), but it could not treat these as errors since DML statements run by this or other programs may change the table before the query is actually executed.
The sql standard calls every form of passing a string version of a query to a database as "dynamic sql" (Not just say SQL constructed in say a store procedure and called with EXECUTE). This is because the database cannot statically analyze those queries at program compilation time, so you may need to link with additional libraries that include the query parser, etc. And supporting "dynamic sql" at all is an optional feature.
We are... Spark's DataFrame is essentially a relational algebra AST-builder. Microsoft's LINQ interprets SQL directly in at compile-time. All of these, however, run queries more or less directly in the system in which they're specified.
It helps to think of SQL strings as an untrusted wire format. Yes, parsing is a pain, but it comes with two main benefits:
(i) The wire format is human writable/interpretable, with all the accompanying benefits, and
(ii) The wire format is easily extensible in a predictable way.
That latter one is particularly useful in keeping SQL's ecosystem open. Take a front-end library like SQLAlchemy or ScalikeJDBC for example. It's not practical for any one such library to support every extension provided by every database engine. SQL provides a fall-back for when you need a back-end feature that hasn't been implemented in any given front-end.
C# LINQ does pass an expression tree into the abstraction before it will then serialize it SQL and then the database deserialize it into an AST. LINQ-to-Objects is in memory and works on the AST directly.
Also both LINQ language syntax and library methods are a builder paradigm for the expression tree. Valid, but still far from ideal representation of an AST.
> Why the hack are we still talking to databases with SQL strings and not directly specifying the Query-AST?
The same reason language servers took off. Instead of one to one mapping, SQL enables one to many mapping with minor tweaks, allowing everyone to do whatever they want over a well known, well defined, mature abstraction.
In the same spirit, I may ask why we're not writing assembly or even machine code, and we have programming languages? Testers, sure, abstraction means clarity up to an extent, but why the developers themselves still use programming languages?
You’re missing the point if you think that that is “in the same spirit”.
SQL-as-strings and SQL-as-AST are still the same thing. What is being proposed it not to write procedural code for record retrieval instead of declarative SQL.
Isn't that what a parameterized query does? It separates the sql logic from the inputs so it can cache the query and then it accepts inputs separately. Safer and more optimal at the same time, the engine doesn't have to re-optimize the same query again for the life of the connection. If my understanding is wrong somebody please correct me, it's kind of hard to get good information on what's going on under the hood with these things.
A SQL query is an AST, but represented in a compact portable form. It also supports functions, procedures, and parameterization for flexible and safe query construction.
Your code would get incredibly large and complicated if you had to specify any serious SQL query as a raw AST.
When you're accessing a SQLite database in code you have to generate a query string. Parameters ameliorate that somewhat, but in many cases you still have to regenerate a new string for each new query. It's inefficient to translate your query into a string only to have it parsed back into something structured by SQLite.
Are you suggesting an alternate syntax for SQL-like queries, which is more compact?
Or a specific one for SQLite?
I'd be very surprised if generating the SQL query string and parsing it again was more than a trivial percentage of the query execution time, but happy to be proven wrong.
Probably, but it's still inefficient in other ways. There's memory issues and the code overhead of dealing with strings. It'd be much nicer to have a binary interface, but admittedly it would be much more complex.
The real solution is to include an alternative to SQL that looks and works like Datalog including things like variables. That would make SQLite 100x more productive for programmers. But that will never happen.
As I said in another part of the thread, I was in a scenario where we were performed millions of inserts into a table of four integers, one row at a time. Generating the string and parsing it again wound up being enough to blow our 10µs time budget.
But with parameters you don't need to do that, or in any other scenario where only (literal) values change.
On the other hand, if you were importing data and each line specified which table and column it had to go into you'd probably have to write a new SQL statement each time.
It was a neutron detector that stored the the position, energy, and time stamp into the database. The design spec was 10,000 neutrons per second, so that was the origin of the time limit.
Of course, half those neutrons arrived within a 20ms window, so we had a buffer to handle the load. However, however, if the average remained above the limit, the buffer would fill up. There was a brief discussion of just ignoring events when the buffer was full, but that could introduce systematic errors in the data that would be impossible to detect, so it was better for the database to just crash.
The solution was to tighten the slits on the neutron beam to lower the count rate to the point that we never filled the buffer. Granted, we were testing a high flux technique that, so that was a bit of a disappointment. Everything else in the instrument could handle an order of magnitude more neutrons, except this database.
By the way, to be fair to the database designers, they were working with Clinton era tech and were probably told that the flux rate would never be a SUSTAINED 10,000 events per second.
Not the OP, but I can give two gains. First off, by passing an AST, instead of just an SQL string, we cut out a huge number of possibly SQL injection attacks. Second, in most of the projects where I've used of SQL, there's been some kind of database object that builds the actual query, which it then converts to a string. The database then takes that string and parses it into an AST. There's some performance gains to be made by skipping the middle man and just creating the AST directly.
Many years ago, I was on a project that needed to add rows to a database with a hard 10µs limit. Each rows was just four integers, so the writing part was trivial. However, allocating the string, formatting the integers to strings, then parsing the resulting string often put us over the time limit. Every time the time limit was breached, we lost about five grand. Why we were using an SQL database for this at all is a story for a different time.
You should be using prepared statements and variable bindings. You should never have to parse any statement more than once (many dynamic language interfaces even cache statements for you, and many SQLite libraries also offer an interface for this out of the box).
You should also never be formatting integers into strings, but simply binding them to the prepared statement. I really hope you weren't interpolating them and formatting each statement with the literal values in the string. That would be horridly inefficient, and a bad misuse of any modern SQL database. I also hope you were properly using transactions to speed up your batches.
LINQ stands for Language-Integrated Query and is incredibly successful at its purpose of providing powerful querying functionality and extensions baked into the C#/.NET language space itself.
This querying framework is what powers translations and compilation into SQL and several other languages (depending on the datastore provider used).
EntityFramework is one of the most advanced ORMs out there and is supremely productive because of LINQ.
Caching the query plan is also going to go further in performance optimisations than just “precompiling” the SQL to a AST.