Hacker News new | ask | show | jobs
by suzzer99 38 days ago
I’ve had to do a ton of SQL stuff lately, which I haven’t really worked with since the late 90s. ChatGPT has been a godsend, not just for me, but for our only coworker who knows SQL well, whom I’d probably be bugging several times a day at my wits’ end.

But no one cares about those kinds of productivity gains. Just the ones that will completely replace us.

3 comments

I find SQL and data(bases) in general to be LLM’s Achilles’ heel. Databases are rarely under version control, so the training data only has one half of the knowledge.

My comments are more in the context of OLAP queries and other non-normalised data often queried via SQL.

I train non-LLM transformer models on (older and rarer) datasets, and automating the ingestion of sprawling datasets with hundreds of columns, often in a variety of local languages and different naming conventions adopted over decades, with quite a few duplicated columns…. The LLMs perform badly, it’s nigh impossible to test (for me as a user in prod) and it’s nearly impossible for the LLM companies to test (in training) to RLVR and RLHF this.

That's interesting - SQL is one of the places I find them the strongest - I think there must be an insane amount of training data out there for SQL. But mostly I'm asking them for ad hoc report queries. Nobody cares if they're bad SQL, they just want to know how many signups there were in March that didn't tick the marketing box. Sounds like you're pushing their capabilities a lot further than I am though - I just want to perform arbitarily complex queries on 3NF data.
Yeah not sure what this guy is talking about, LLMs excel with queries because the SQL language is pretty small in scope and its easy to test the output. Table structure and relationships are easy to feed to the AI.

> I train non-LLM transformer models on (older and rarer) datasets, and automating the ingestion of sprawling datasets with hundreds of columns, often in a variety of local languages and different naming conventions adopted over decades

All of this sounds like basic data processing

"Nobody cares if they're bad SQL"

Laid off your DBAs I see.

Ok, ok. Nobody who matters cares if they're bad SQL ;)
Just use an LLM to make a good knowledge base for the databases. Based on schema info and production queries. An agent can use that to write queries that work.
I'm the old school type who writes out a document that explains what I plan on doing in markdown even if it's generic like "a window with x and y buttons" and the logic flow and then use that to have ai write a plan with me before I send it off to execute it. This has worked super well.

I do enjoy giving the frontier models wacky projects that I can't even find examples of how to do online but I don't expect any results or need them and some have done really well with it while others fall on their face (models)

I'm always amazed by those comments. Why couldn't you buy a book on SQL[0], and spend a week on it? Or just go over to YouTube for a refresher?

[0]: Like https://www.oreilly.com/library/view/sql-queries-for/9780134...

I'm amazed you think that instead of using an LLM that someone will go buy a book and spend a week learning something that, judging by the fact that they last used it 30 years ago, likely won't be relevant for them soon.
It's not only that I rarely use it, it's also that it's ugly. It's Relational Cobol. It's as loveable as Oracle. The vendor specific dialects don't even agree on how to do recursive queries do they?

Unfortunately I am very good at forgetting things I resented having to learn, and SQL is definitively one of them.

So you don’t understand what you generate with ai and think that it will be a solution for a problem you can only solve using sql.
No, it's easy enough to understand the query once the AI has generated it. I have looked up how to do it many times after all.
Yes
If the AI's query pulled what I intended to pull, why should I care to understand the SQL any more than I should understand the Query Plan or the Machine Code?
There's nothing wrong with using SQL only when you know in advance exactly which records you wish to query.

But if you ever need to query unknown data, then probably you should learn SQL a bit deeper.

As with regex, querying is about not getting what you don't want as much as it is about getting what you want. And the former of the two is much more difficult to verify.
SQL is (was?) one of my strongest skills, I enjoy it a lot, and I still reach for the LLM. It's just faster than me, and when it goes wrong (rarely) I can correct it in plain English.
This is fine for a moderately sized query. When your queries start taking in 8 joins and 20 fields per table because you're running queries on Presto with 5 TB of data, not only is it drastically better at writing (because it doesn't mess up the fields), you can ask it to try the query 5 different ways to help you land on the most optimal.
That's exactly where I would expect it to fail somewhere, changing some part of the query every time it writes one.
In my experience, Claude (at least Opus and Sonnet) is pretty good about not misremembering itself.

I think you may be describing the experience of 6-12 months ago.

This is a great example of AI tech-debt and fragility.

An eight-join query is going to be nigh on unmaintainable should the requirements change, leading to a change-break-change-break spiral as your preferred coding agent tries to fix its previous fixes.

Maybe the wise way to use AI would be to sort out the schema.

This feels wrong. 8 joins is almost certainly reporting stuff, not transactional. Contrary to what some SQL-averse devs think, 300 lines of SQL is actually more maintainable than the equivalent ~1000 lines of application code. It's also much faster. And I do think that's the real conversion, because SQL is a much higher level language than currently available application languages. It's also declarative in nature, which helps maintainance.

A highly normalized DB can easily end up with 8 joins required for some function. That's really not out of the question. "Sorting out" the schema then would be... denormalization, which is a thing, but you need to know why you're doing it. And I think 8 joins isn't enough of a reason.

Yes but developers (or at least web backend developers, who are the ones I interact with the most) are extremely averse to SQL and normalization.
I think that's what was meant by "reporting stuff, not transactional".
When you have a general idea of what smells bad vs what's okay...why?

I'd rather get it from the LLM and review

Simple, because books don't earn OpenAI and Anthropic a dime.
A book on .... SQL? What is this, the 1970s?
Extremely weird take.