|
|
|
|
|
by juxtaposicion
983 days ago
|
|
We’re also building billion-scale pipeline for indexing embeddings. Like the author, most of our pain has been scaling. If you only had to do millions, this whole pipeline would be a 100 LoC. but billions? Our system is at 20k LoC and growing. The biggest surprise to me here is using Weavite at the scale of billions — my understanding was that this would require tremendous memory requirements (of order a TB in RAM) which are prohibitively expensive (10-50k/m for that much memory). Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory. |
|
Yeah a ton of the time and effort has gone into building robustness and observability into the process. When dealing with millions of files, a failure half way through it is imperative to be able to recover.
RE: Weaviate: Yeah, we needed to use large amounts of memory with Weaviate which has been a drawback from a cost perspective, but that from a performance perspective delivers on the requirements of our customers. (on Weaviate we explored using product quantization. )
What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?