|
|
|
|
|
by newhouseb
1033 days ago
|
|
I think this is likely a consequence of a couple of factors: 1. Fancy token selection w/in batches (read: beam search) is probably fairly hard to implement at scale without a significant loss in GPU utilization. Normally you can batch up a bunch of parallel generations and just push them all through the LLM at once because every generated token (of similar prompt size + some padding perhaps) takes a predictable time. If you stick a parser in between every token that can take variable time then your batch is slowed by the most complex grammar of the bunch. 2. OpenAI appears to work under the thesis articulated in the Bitter Lesson [i] that more compute (either via fine-tuning or bigger models) is the least foolish way to achieve improved capabilities hence their approach of function-calling just being... a fine tuned model. [i] http://www.incompleteideas.net/IncIdeas/BitterLesson.html |
|