| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dleeftink 894 days ago

Thanks for sharing. I 'd like to know what the (re)compute time might be when adding, say, another million documents using this pipeline. The cluster embedding approach in my view, while streamlined, still adds a (sometimes significant) timebump when high throughput is required.

I see some significant speedups can be achieved when discretising dimensions into buckets, and doing a simple frequency count of associated buckets -- leaving only highly related buckets per document. These 'signatures' can then be indexed LSH style and a graph construed from documents with similar hashes.

When the input set is sufficiently large, this graph contains 'natural' clusters, without any UMAP or k-means parameter tuning required. When implemented in BQ, I achieve sub minute performance for 5-10 million documents, from indexing to clustering.