| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rdli 806 days ago
	I'm running ollama, but it's still slow (it's actually quite fast on my M2). My working theory is that with standard cloud VMs, memory <-> CPU bandwidth is an issue. I'm looking into vLLM. And as to sidestepping inference, I can totally do that. But I think it's so much better to be able to ask the LLM a question, run a vector similarity search to pull relevant content, and then have the LLM summarize this all in a way that answers my question.

1 comments

Oh yeah! What I meant is having Ollama run on the user's machine. Might not work for the use case you're trying to build for though :)