you can spawn multiple llama.cpp servers and query them simultaneously. It’s actually better this way because you get to run different models for different purposes or do sanity checks via a second model.
that is correct, however I am already using all of my VRAM. it would mean I have to degrade my model quality. I instead decided that I would rather have one solid model, and have all my use cases tied to that one model. using RAM instead proved to be problematic for the reasons I mentioned above.
if I had any free VRAM at all, I would fit faster-whisper before I touch any other LLM lol
if I had any free VRAM at all, I would fit faster-whisper before I touch any other LLM lol