Hacker News new | ask | show | jobs
by rdli 760 days ago
I'm running ollama, but it's still slow (it's actually quite fast on my M2). My working theory is that with standard cloud VMs, memory <-> CPU bandwidth is an issue. I'm looking into vLLM.

And as to sidestepping inference, I can totally do that. But I think it's so much better to be able to ask the LLM a question, run a vector similarity search to pull relevant content, and then have the LLM summarize this all in a way that answers my question.

1 comments

Oh yeah! What I meant is having Ollama run on the user's machine. Might not work for the use case you're trying to build for though :)