Hacker News new | ask | show | jobs
by larodi 418 days ago
Having done my masters on the topic of grammar-assisted text2sql let me add some additional context here:

- first of all local inference can never beat cloud inference for the very simple reason that costs go down with batching. it took me two years to actually understand what batching is - the LLM tensors flowing through transformer layers has a dimension designed specifically for processing data in parallel. so no matter if you process a 1 sequence or 128 sequences the costs are the same. i've read very few articles overstating this, so bear in mind - this is the primary stopper for competing local inference with cloud inference.

- second, and this is not a light one to take - LLM-assisted text2sql is not trivial, not at all. you may think it is, you may expect cutting-edge models to do it right, but there are ...plenty of reasons models fail so badly at this seemingly trivial task. you may start with arbitrary article such as https://arxiv.org/pdf/2408.14717 and dig the references, sooner or later you will stumble on one of dozens overview papers by mostly Chinese researchers (such as https://arxiv.org/abs/2407.10956) where overview of approaches is summarized. Caution: you may feel both inspired AI will not take over your job, or you may feel miserable how much effort is spent on this task and how badly everything fails in real-world scenarios

- finally, something we agreed with a professor advising a doctorate candidate whose thesis surprisingly was on the same topic. basically given GraphQL and other structured formats such as JSON, which LLMs are much better leaned on than the complex grammar of SQL which is not a regular grammar, but context-free one, which takes more complex machines to parse it and also very often recursion.

- which brings us to the most important question - why commercial GPTs fare so much better on it than local models. well, it is presumed top players, not only use MoEs but they also employ beam search, perhaps speculative inference and all sorts of optimizations on the hardware level. while this all is not beyond comprehension for a casual researcher at a casual university (like myself) you don't get to easily run this all locally. I have not written an inference engine myself, but I imagine MoE and beam search is super compled, as beam search basically means - you fork the whole LLM execution state and go back and forth. Not sure how this even works together with batching.

So basically - this is too expensive. Besides atm (to my knowledge) only vllm (the engine) has some sort of reasonably working local beam search. I would've loved to see llama.cpp's beam search get a rewrite, but it stalled. Trying to get beamsearch working with current python libs is nearly impossible for commodity hardware, even if you have 48gigs of ram, which already means a very powerful GPU.

3 comments

Sounds like an interesting masters thesis. Is your masters thesis available online somewhere?
Well, not sure about the final doc that went to the university, but this is the almost final draft.

https://docs.google.com/document/d/e/2PACX-1vSyWbtX700kYJgqe...

Since its in Cyrillic you should perhaps use a translation service. There are some screens showing results, though as I was really on a tight deadline, and its not a PHD but masters thesis, I decided to not go into in-depth evaluation of the proposed methodology against SPIDER (https://yale-lily.github.io/spider). Even though you can find the simplifed GBNF grammar, also some of the outputs. The grammar, interestingly it benefits/exploits a bug in llama.cpp which allows some sort of recursively-chained rules. Bibliography is in English, but really - there is so much written on the topic, by no means comprehensive.

Sadly no open inference engine (at time of writing) was both good enough in beam search, and grammars, so this whole things needs to perhaps be redone in pytorch.

If I find myself in a position to do this for commercial goals, I'd also explore the possibility of having human-catered SQLs against the particular schema, in order to guide the model better. And then do RAG on the DB for more context. Note: I'm already doing E/R model reduction to the minimal connected graph which includes all entities of particular interest to the present query.

And finally, since you got that far - the real real problem with restricting LLM output with grammars is the tokenization. Because all parsers work reading one char at a time, and tokens are very often few chars, so the parser in a way needs to be able to "lookahead", which it normally does not. I believe OpenAI wrote they realized this also, but I can't really find the article atm.

Thanks. Took a quick look and definitely needed to use Google Translate but seems to have worked to get the gist of it.
There's local applications of parallel processing; your average chatbot wouldn't use it, but a research bot with multiple simultaneous queries will, for example.

Better local beamsearch would be really nice to have, though.

I do wonder if recursion is particularly hard for LLMs, given that they have a hard limit on how much they can loop for a given token. (Absent beam search, reasoning models, and other trickery.)
Given a prolog (not problog, but the non-stochastic one) source is a parametric grammar, we can perhaps* argue the inference on the programming logic level can be unfolded by using a pen and pencil. think L-systems, they are self-similar, and recursively defined. The catch is that the whole sequence gets rewritten on each step. If you can get the LLM to do this as it progresses with generation - you get recursion. Question is whether you can get the LLM rewrite the context window, and my bet would be someone is already working on it.

* I say perhaps, because PROLOG engines normally don't rewrite strings like crazy while doing inference, so my statement may be somewhat off.