| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rdli 763 days ago
	I'm working on something like this! It's simple in concept, but there are lots of fiddly bits. A big one is performance (at least, without spending $$$$$ on GPUs.) I haven't found that much in terms of how to tune/deploy LLMs on commodity cloud hardware, which is what I'm trying this out on.

3 comments

leobg 763 days ago

You can use ONXX versions of embedding models. Those run faster on CPU.

Also, don’t discount plain old BM25 and fastText. For many queries, keyword or bag-of-words based search works just as well as fancy 1536 dim vectors.

You can also do things like tokenize your text using the tokenizer that GPT-4 uses (via tiktoken for instance) and then index those tokens instead of words in BM25.

link

rdli 763 days ago

Thanks! I should have been clearer -- embeddings are pretty fast (relatively) -- it's inference that's slow (I'm at 5 tokens/second on AKS).

link

jnnnthnn 763 days ago

Could you sidestep inference altogether? Just return the top N results by cosine similarity (or full text search) and let the user find what they need?

https://ollama.com models also works really well on most modern hardware

link

rdli 763 days ago

I'm running ollama, but it's still slow (it's actually quite fast on my M2). My working theory is that with standard cloud VMs, memory <-> CPU bandwidth is an issue. I'm looking into vLLM.

And as to sidestepping inference, I can totally do that. But I think it's so much better to be able to ask the LLM a question, run a vector similarity search to pull relevant content, and then have the LLM summarize this all in a way that answers my question.

link

jnnnthnn 763 days ago

Oh yeah! What I meant is having Ollama run on the user's machine. Might not work for the use case you're trying to build for though :)

link

pizza 763 days ago

This style of embeddings could be quite lightweight/cheap/efficient https://github.com/cohere-ai/BinaryVectorDB

link

Tostino 763 days ago

Embedding models are generally lightweight enough to run on CPU, can be done in the background while the user isn't using their device.

link