Hacker News new | ask | show | jobs
by gpderetta 717 days ago
Using an LLM is just one of the ways to generate embedding. To do k-means you still need to pick a distance function, like jaccard; k-means is probably not ideal for near duplicates, and you can use min-hash to speed up k-means as a pre-pass.

I don't think the vector DB adds much. You could use it to speed up the lookup of the min-hash sketches if you have hundreds of millions of documents, but it is probably overkill.

1 comments

I was picturing doing the deduplication as a map-reduce process over a huge (petabytes) dataset, where every worker is blind to what embeddings other workers have already generated. In such a case, shoving the embeddings generated by each worker into a shared vector DB, and having it (maybe incrementally) clustering the vectors as it receives them, would be acting as the "reduce" step.