| I built a pipeline to automatically cluster and visualize large amounts of text documents in a completely unsupervised manner: - Embed all the text documents. - Project to 2D using UMAP which also creates its own emergent "clusters". - Use k-means clustering with a high cluster count depending on dataset size. - Feed the ChatGPT API ~10 examples from each cluster and ask it to provide a concise label for the cluster. - Bonus: Use DBSCAN to identify arbitrary subclusters within each cluster. It is extremely effective and I have a theoetical implementation of a more practical use case to use said UMAP dimensionality reduction for better inference. There is evidence that current popular text embedding models (e.g. OpenAI ada, which outputs 1536D embeddings) are way too big for most use cases and could be giving poorly specified results for embedding similarity as a result, in addition to higher costs for the entire pipeline. |