|
|
|
|
|
by FezzikTheGiant
765 days ago
|
|
Are they able to run at a good speed? I'm just wondering what the economics would look like if I want to create agents in my games. I don't think many are going to be willing to get with usage based / token based pricing. That's the biggest roadblock with building LLM-based games right now. Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right? What I think is, from my limited understanding about this field, if smaller models can run on consumer hardware reliably and speedily that would be a game changer. |
|
Not on most consumer computers, which likely lack a dedicated GPU. My M2 struggles (only thing that makes it warm) with a 7B model, but token speed is unbearable. I switched to remote APIs for the speed.
If you are targeting gamers with a GPU, the answer may change, but as others have pointed out, there are numerous issues here.
> This would virtually make inference free right?
Yes-ish, if you are only counting your dollars, however it will slow their computer down and have slow response time, which will impact adoption of your game.
If you want to go this route, I'd start with a 2B sized model, and not worry about shipping it nicely. Get some early users to see if this is the way forward.
I suspect that remote LLM calls with sophisticated caching (cross user / convo / pre-gen'd) is something worth exploring as well. IIRC, people suspected gtp3-turbo was caching common queries and avoided the LLM when it could, for the speed