Hacker News new | ask | show | jobs
by phillipcarter 1040 days ago
I find it a little funny that Redis is considered here. We use it! We just store vectors in redis, fetch what we need, and run cosine similarity in memory. It’s very fast and works well. It’s not suitable for large amounts of data, but if your “knowledge base” can be measured in MB of vectors (instead of GB or TB) then it’s worth considering.

I’m just not sure if I’d consider it a database. It’s just a long lived cache for us.

1 comments

How do you generate your vectors?

I'm working on something that needs a similar, small and fast, vector search implementation. Crucially we also need fast indexing speed for our usecase, but a bottlneck we're hitting is the time it takes to generate vector embeddings for larger documents in our dataset (a few megabytes in our case). Wondering what's the fastest way to approach that?

Are you tied to any particular transformer model? Using a smaller model, throwing more hardware at the problem, or generating embeddings in parallel are easy ways to make it faster. Depending on what you're doing with the output you may also consider truncating your documents (can be good for stuff like clustering) or breaking apart your documents (can improve search performance).

Another option if you just want search (and aren't training or tuning your own models) is a managed search offering where you aren't responsible for generating embeddings.

Thanks for the advice! We're not tied to any model, no.

Naively I guess, at first we hoped to get by using a 3rd party API. We're hosted in GCP and tried using the Vertex AI `textembedding-gecko` model initially. But now we're investigating running models on our own infra, although not sure where we've got with it yet as someone else is working on that.

If you're committed to using a 3rd-party API, then parallelizing your API calls seems like the easiest way to speed things up. The benefits of a 3rd party API are - of course - that you're likely going to be able to generate embeddings using a much more powerful model. That being said, you may not need something as powerful as PaLM and having everything go over a network might just take too long. IME (which is entirely use-case dependent) something like SentenceTransformers (even the smallest pretrained models) can get you up and running on your own infra pretty quickly and generate embeddings with reasonable performance in a reasonable amount of time on modest hardware.
It's just OpenAI embeddings. We fetch them and just push then to Redis with a 30 day TTL. The backing data that's embedded rarely changes, so we don't need to create a new embedding for very often. We batch what needs to be embedded.

The full RAG workflow - using ADA to embed the user input, deserialize embeddings, run cosine similarity, and call gpt-3.5-turbo - is about 3 seconds end-to-end to get a result.

Thanks!

OpenAI embeddings are 1 per request payload, right? Have you hit any rate limits doing that?

We have a performance budget of ~1 second for the generate-index-search pipeline, which may or may not be feasible. I discounted OpenAI because it seemed like we're guaranteed to hit the rate limit if we flood them with concurrent requests for embeddings. Typical corpus size that we need to work with is 20 concurrent documents ranging from ~100kb to ~2mb. Chunking those documents to fit the 8k token context window balloons the request count further.

You absolutely want to chunk them smaller than 8k. Have you tested different chunk strategies? It can make a huge difference for actually recalling useful information in small enough chunks to be usable.
Thanks for the tip, I haven't played around with chunk size much at all so far.
marqo.ai has excellent indexing throughput as vector generation and vector retrieval are both contained within a marqo cluster. You can use it with multi-gpu, cpu, etc. It's also horizontally scalable.