Hacker News new | ask | show | jobs
by aleph_minus_one 1082 days ago
> - generate SQL queries at the speed of mouth

Because of the points this is the nearest to the work that some colleagues do, I anakyze this point (but you could ask similar questions about many of the other points):

In my experience, writing correct SQL queries (which often tend to be quite non-trivial because of the internal complexity of the projects) typically involves a lot of knowledge about the whole system that my colleagues and I work on. Even if I could copy-paste this information, written down once, into the AI chat window:

- I seriously doubt that any of these AI chat bots would be able to generate a remotely decent SQL query based on this information, if only because these SQL queries look really different from what you would see in typical CRUD web applications (for a very instructive example think into the direction of ETL for unifying historically separated lines of business where you often have lots of discussions with the respective colleagues to clear up very subtle details what the code is actually supposed to do in some strange boundary cases that exist because of some historical reasons (which one wants to get rid of))

- even explaining what the SQL query is supposed to do would in my opinion take more time than simply writing it down. Even ignoring the previous point: it is very typical that explaining in sufficient detail what the code is supposed to do would take far more time than simply writing it. A lot of programming work is not writing some scaffolding of some CRUD app or implementing a textbook algorithm.

2 comments

Have you tried this at all?

I find that many of the generative AI models (GPT-4, 3.5, even MPT-30B running on my laptop) are really shockingly good at SQL.

Paste in a query and ask it for a detailed explanation. I've genuinely not seen it NOT provide a good result for that yet.

Generating new SQL queries is a bit harder, because of the context you need to provide - but I've had very strong results from that as well.

I've had the best results from providing both the schema and a couple of example rows from each table - which helps it identify things like "the country column contains abbreviations like US and GB".

If you've found differently I'd love to hear about it.

> Paste in a query and ask it for a detailed explanation. I've genuinely not seen it NOT provide a good result for that yet. [...] If you've found differently I'd love to hear about it.

I have not directly tried it (the employer does not allow AI chatbots for any application intended for production (i.e. more sensitive stuff), but only for doing experiments), but working on the code I very rarely had the problem that I could not understand what some single (SQL) line of code does in the "programming sense".

The central problem that rather occurs often is understanding why this line does exist and why things are implemented the way they are.

Just to give an example: to accelerate some queries, I thought some index would make sense (colleagues principally agreed; it would likely accelerate a particular query that I had in mind). But there exists a good reason why there exists no index at this table (as the respective colleague explained to me). This again implies that for ETL stuff involving particular tables, one should make use of temporary tables where possible instead of JOINs; this is the reason why the code is organized as it is. This is the kind of explanation that I need, which surely no AI can deliver.

Or another example: why does some particular function (1) have a rights check for a "more powerful" role and a related one (2) does not need one? The reason is very interesting: principally having this check (for a "more powerful" role) does not make a lot of sense, but for some very red-tape reasons auditors requested that only a particular group of roles shall be allowed to execute (1), but they were perfectly fine with a much larger group of users being allowed to execute (2). Again something that no AI will be able to answer.

Those two questions require a whole bunch of additional context. Has it ever been written down anywhere, or does it exist only in the heads of members of staff who understand those decisions? If the latter then yeah, there's clearly no way an AI could ever guess those things.

Part of the trick of making good use of LLMs is having a good instinct as to what kind of questions they will be able to ask and what kind of questions they are likely to mess up.

As a quick demo, I threw quite a gnarly real-world SQL query from this notebook - https://observablehq.com/@simonw/blog-to-newsletter#sql - into GPT-4 just now and gave it the following prompt:

> Explain this SQL query four times: first, as a high level explanation of what it does. Secondly, as some pithy highlights as to clever tricks it uses. Thirdly, as a step-by-step guide to exactly what each piece of the query does. Finally, provide some advice on how the query could be improved.

Here's what it produced in response: https://chat.openai.com/share/47b2f1c5-6b88-4ff2-b6d1-8f58da...

The high level explanation is, I think, excellent. I wrote that query and I'd forgotten the detail about how it sorts with entries first.

The tips for improvements are the weakest part, since they make assumptions that don't hold for this particular query (for one thing, SQLite doesn't have the ability to run stored procedures).

Write some fake table. Paste it into ChatGPT, then come back to the discussion.

So far you've said that things like "This is the kind of explanation that I need, which surely no AI can deliver." but have not actually tried the system?

As GP asked: have you tried ChatGPT or similar LLMs? If not, go do it .. you may be surprised.

> This is the kind of explanation that I need, which surely no AI can deliver.

But have you tried? GPT-4 is great for coming up with hypotheses of why thing are the way they are.

One approach I've had a lot of success with: always ask for multiple options.

In this case I might try a prompt along the lines of:

"Here is the schema for a table: (schema here). Here is a query that runs against it: (query here). Provide several suggestions for potential indexes that might speed up the query, and for each of those suggestions provide several hypothetical reasons that the index might be a bad idea."

I tried a version of that here, the results look pretty good to me: https://chat.openai.com/share/dc5d7007-cf05-4e5c-89c4-fc669b...

I approach things a bit obliquely. I create a custom made DSL (starting from scratch in each conversation, often) that allows me to model my query the way I want. Then, I write a traditional SQL builder on that DSL (or more like, ask GPT to do it for me). Then, I generate DSL statements that match my current domain, and more importantly, modify existing ones.

So, at each step, I do almost trivial transformations.

One key ingredient is that the DSL should include many "description" fields that incorporate english language, because that helps the model "understand" what the terser DSL fields are for.

Straight SQL is a crapshoot, and as you said, more often than not, either obviously or subtly broken or for another database. Which makes sense, considering how much different flavors of SQL it has in its training corpus and how much crappy SQL is out there anyway.

Another thing that helps is use extremely specific "jargon" for the domain you want to write queries for. Asking for "accrual revenue" and "yoy avg customer value" (yes, yoy, not year over year) often tends to bring back much higher quality than just asking for "revenue" or "customer value".