| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by juxtaposicion 983 days ago

We’re also building billion-scale pipeline for indexing embeddings. Like the author, most of our pain has been scaling. If you only had to do millions, this whole pipeline would be a 100 LoC. but billions? Our system is at 20k LoC and growing.

The biggest surprise to me here is using Weavite at the scale of billions — my understanding was that this would require tremendous memory requirements (of order a TB in RAM) which are prohibitively expensive (10-50k/m for that much memory).

Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory.

3 comments

ddematheu 983 days ago

Co-author of article here.

Yeah a ton of the time and effort has gone into building robustness and observability into the process. When dealing with millions of files, a failure half way through it is imperative to be able to recover.

RE: Weaviate: Yeah, we needed to use large amounts of memory with Weaviate which has been a drawback from a cost perspective, but that from a performance perspective delivers on the requirements of our customers. (on Weaviate we explored using product quantization. )

What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?

link

juxtaposicion 983 days ago

Disk retrieval is definitely slower. In-memory retrieval typically can be ~1ms or less, whereas disk retrieval on a fast network drive is 50-100ms. But frankly, for any use case I can think of 50ms of latency is good enough. The best part is that the cost is driven by disk not ram, which means instead of $50k/month for ~TB of RAM you're talking about $1k/mo for a fast NVMe on a fast link. That's 50x cheaper, because disks are 50x cheaper. $50k/mo for an extra 50ms latency is a pretty clear easy tradeoff.

link

bryan0 983 days ago

we've been using pgvector at the 100M scale without any major problems so far, but I guess it depends on your specific use case. we've also been using elastic search dense vector fields which also seems to scale well, but of course its pricey but we already have it in our infra so works well.

link

ddematheu 983 days ago

What type of latency requirements are you dealing with? (i.e. look up time, ingestion time)

Were you using postgres already or migrated data into it?

link

juxtaposicion 983 days ago

I'd love to know the answer here too!

I've ran a few tests on pg and retrieving 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances. And that was without a vector index.

Entirely possibly my take was too cursory, would love to know what latencies you're getting bryan0!

link

losteric 983 days ago

> 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances.

Is there a write up of the analysis? Something seems very wrong with that taking 700ms

link

bryan0 983 days ago

we have look up latency requirements on the elastic side. on pgvector it is currently a staging and aggregation database so lookup latency not so important. Our requirement right now is that we need to be able to embed and ingest ~100M vectors / day. This we can achieve without any problems now.

For future lookup queries on pgvector, we can almost always pre-filter on an index before the vector search.

yes, we use postgres pretty extensively already.

link

omneity 983 days ago

What size are your embeddings?

link

bryan0 983 days ago

384 dims. we're using: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

link

esafak 983 days ago

What kind of retrieval performance are you observing with Lance?

link

juxtaposicion 983 days ago

For a "small" dataset of 50M and 0.5TB in size with 20 results get around 50-100ms.

link