Hacker News new | ask | show | jobs
by behnamoh 888 days ago
you can spawn multiple llama.cpp servers and query them simultaneously. It’s actually better this way because you get to run different models for different purposes or do sanity checks via a second model.
1 comments

that is correct, however I am already using all of my VRAM. it would mean I have to degrade my model quality. I instead decided that I would rather have one solid model, and have all my use cases tied to that one model. using RAM instead proved to be problematic for the reasons I mentioned above.

if I had any free VRAM at all, I would fit faster-whisper before I touch any other LLM lol