|
|
|
|
|
by lmeyerov
849 days ago
|
|
Yes, but not the reasons you're thinking - If you have a fixed time budget and increase the GPU memory+compute available, you can directly query a bigger model. Raw models are basically giant lookup functions, and without the extra memory+compute, they'll spill to slower layers of your memory hierarchy, e.g., GPU RAM -> CPU RAM -> disk. Likewise, with MoE models, there are multiple concurrent models being queried. - Most 'good' LLM systems are not just direct model calls, but code-based agent frameworks on top that call code tools, analyze the results, and decide to edit+retry things. For example, if doing code generation, they may decide to run lint analysis & type checking on a generated output, and if issues, ask the LLM to try again. In Louie.AI, we will even generate database queries and run GPU analytics & visualizations in on-the-fly Python sandboxes. These systems will do backtracking etc retries, and > 50% of the quality can easily come from these layers: LLM leaderboards like HumanEval increasingly report both the raw model + what agent framework on top. All this adds up and can quickly become more expensive than the LLM. So better systems can enable more here too. |
|
So MoE models are a bit like thinking tools running concurrently, right(?), sieving through training data on paths that are the same contextually, but different in terms of specificity and sensitivity.
If the agents/experts/ architectures - the code - don't have the minimum required amount of memory & processing power, they might even miss entire bunches of tokens that are or might be relevant within the given (the prompt) and predicted/requested context. So more processing power and or time is relevant only to the extent, here: size, of the to-be-queried-at-inference-time training data (tokens and weights).
Now here's where I find myself exactly within the realm that I was in when I phrased my question: analysing the result of a request and evaluating different sets of tokens, which, I now understand, makes much more sense within the subject of code generation than with the recitation of facts or bits of narratives.
Generated code has functions (things to do with other things). Functions can be done more or less efficient, while even the least efficient code works "more than good and fast enough". There is no value in looping through versions of fact and fiction when the answer fits the expectation. And if it doesn't fit, users can have an actual conversation, which is where I get another part of my answer, which is that more processing power only becomes relevant in relation to the amount of concurrent requests in relation to the parts of the training data that are queried at inference time.
No single request will ever query so much data at the same time, that memory and compute become a bottleneck.
It definitely can become a bottleneck when a long/large/broad( but specific) request gets processed by MoEs simultaneously or when versions of results of engineering tasks are being evaluated. But that is simply not within the task or design of current LLMs and is instead added on top (or as a wrapper, for example, which I still fail to find a non-replaceable usecase for while also still being certain that I will find one once I get to LLMs and AIs).
Again, thanks!