| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jnnnthnn 807 days ago
	Could you sidestep inference altogether? Just return the top N results by cosine similarity (or full text search) and let the user find what they need? https://ollama.com models also works really well on most modern hardware

1 comments

rdli 807 days ago

I'm running ollama, but it's still slow (it's actually quite fast on my M2). My working theory is that with standard cloud VMs, memory <-> CPU bandwidth is an issue. I'm looking into vLLM.

And as to sidestepping inference, I can totally do that. But I think it's so much better to be able to ask the LLM a question, run a vector similarity search to pull relevant content, and then have the LLM summarize this all in a way that answers my question.

link

jnnnthnn 807 days ago

Oh yeah! What I meant is having Ollama run on the user's machine. Might not work for the use case you're trying to build for though :)

link