Hacker News new | ask | show | jobs
by derefr 711 days ago
As a document clustering / dataset deduplication technique, how well does "throwing ML at the problem" (e.g. using a pre-trained LLM encoder to generate vector embeddings of your documents, and then sticking those vectors into a vector DB and doing k-means to cluster the vectors) compare, quality-wise and performance-wise, to these simpler discrete-algorithm methods?
2 comments

Using an LLM is just one of the ways to generate embedding. To do k-means you still need to pick a distance function, like jaccard; k-means is probably not ideal for near duplicates, and you can use min-hash to speed up k-means as a pre-pass.

I don't think the vector DB adds much. You could use it to speed up the lookup of the min-hash sketches if you have hundreds of millions of documents, but it is probably overkill.

I was picturing doing the deduplication as a map-reduce process over a huge (petabytes) dataset, where every worker is blind to what embeddings other workers have already generated. In such a case, shoving the embeddings generated by each worker into a shared vector DB, and having it (maybe incrementally) clustering the vectors as it receives them, would be acting as the "reduce" step.
I have seen it work better than LSH.

Each time you embed a document you search for approximate nearest neighbours before adding it, so it is O(N) like MinHash. Vector indexes like HNSW and PQ have better performance/quality tradeoffs than SimHash LSH which is the analogue of MinHash for cosine distance.

The quality depends on what you mean by near duplicate and the embedding model you use. Current models work well, and if you have labelled data you can fine tune them to be better.

The main drawback is the additional cost of embedding all the documents, especially for longer documents. But this cost has dropped really quickly with smaller models, better optimisations, and faster hardware.

If this is true, then why do you think that — as the OP article states — the developers of GPT3 chose to use non-ML-based techniques to deduplicate their dataset, when they would be the most equipped to use ML-based approaches?

Just the pure compute cost of needing to run an ML encoder over petabytes of data?

Or maybe because for their use-case — eliminating redundancy to reduce total dataset size and therefore training time — a non-domain-specific vectorization with a high-false-negative cluster-discovery rate was acceptable, because it just meant they'd "compress" the dataset slightly less well, and so get slightly more training time? (At the expense of increased bias in training toward the saliency of the features that weren't dedup'ed out; but that was going to happen regardless, and they likely already had a fully-general technique later in the pipeline for countering that.)