| Having done my masters on the topic of grammar-assisted text2sql let me add some additional context here: - first of all local inference can never beat cloud inference for the very simple reason that costs go down with batching. it took me two years to actually understand what batching is - the LLM tensors flowing through transformer layers has a dimension designed specifically for processing data in parallel. so no matter if you process a 1 sequence or 128 sequences the costs are the same. i've read very few articles overstating this, so bear in mind - this is the primary stopper for competing local inference with cloud inference. - second, and this is not a light one to take - LLM-assisted text2sql is not trivial, not at all. you may think it is, you may expect cutting-edge models to do it right, but there are ...plenty of reasons models fail so badly at this seemingly trivial task. you may start with arbitrary article such as https://arxiv.org/pdf/2408.14717 and dig the references, sooner or later you will stumble on one of dozens overview papers by mostly Chinese researchers (such as https://arxiv.org/abs/2407.10956) where overview of approaches is summarized. Caution: you may feel both inspired AI will not take over your job, or you may feel miserable how much effort is spent on this task and how badly everything fails in real-world scenarios - finally, something we agreed with a professor advising a doctorate candidate whose thesis surprisingly was on the same topic. basically given GraphQL and other structured formats such as JSON, which LLMs are much better leaned on than the complex grammar of SQL which is not a regular grammar, but context-free one, which takes more complex machines to parse it and also very often recursion. - which brings us to the most important question - why commercial GPTs fare so much better on it than local models. well, it is presumed top players, not only use MoEs but they also employ beam search, perhaps speculative inference and all sorts of optimizations on the hardware level. while this all is not beyond comprehension for a casual researcher at a casual university (like myself) you don't get to easily run this all locally. I have not written an inference engine myself, but I imagine MoE and beam search is super compled, as beam search basically means - you fork the whole LLM execution state and go back and forth. Not sure how this even works together with batching. So basically - this is too expensive. Besides atm (to my knowledge) only vllm (the engine) has some sort of reasonably working local beam search. I would've loved to see llama.cpp's beam search get a rewrite, but it stalled. Trying to get beamsearch working with current python libs is nearly impossible for commodity hardware, even if you have 48gigs of ram, which already means a very powerful GPU. |