|
|
|
|
|
by derefr
711 days ago
|
|
As a document clustering / dataset deduplication technique, how well does "throwing ML at the problem" (e.g. using a pre-trained LLM encoder to generate vector embeddings of your documents, and then sticking those vectors into a vector DB and doing k-means to cluster the vectors) compare, quality-wise and performance-wise, to these simpler discrete-algorithm methods? |
|
I don't think the vector DB adds much. You could use it to speed up the lookup of the min-hash sketches if you have hundreds of millions of documents, but it is probably overkill.