Hacker News new | ask | show | jobs
by bauefi 478 days ago
It depends on how you do retrieval. If you just use dense embeddings for example you can get the latency of one search query down to maybe something like 400ms. In that case multiple sequential look ups would be ok but your embeddings need to be good enough of course
2 comments

It's not just the retrieval, tool calls entail another call to the LLM (ToolMessage) and possibly the result will then require other tool calls. Massive latency.
These ultra fast embeddings are really cool, because you can just spam them at everything and it's pretty much instant.

I was able to get them to answer very simple questions without any vector database or pre indexing, just expanding the search query to synonyms, then using normal fulltext search, using embeddings to match article titles to the query, plus adding a few "Personality documents" that are always in every result set no matter what.

Then I do chunking on the fly based on similarity to to query.

Retrieval takes about 1 second on a CPU, but then the actual LLM call takes 10 to 40 seconds, because you need about 1500 bytes of context to consistently get something that has the answers in it... Not exactly useful at the moment on cheap consumer hardware but still very interesting.

https://huggingface.co/blog/static-embeddings