Hacker News new | ask | show | jobs
by minimaxir 777 days ago
A modern recommendation for UMAP is Parametric UMAP (https://umap-learn.readthedocs.io/en/latest/parametric_umap....), which instead trains a small Keras MLP to perform the dimensionality reduction down to 2D by minimizing the UMAP loss. The advantage is that this model is small and can be saved and reused to predict on unknown new data (a traditionally trained UMAP model is large), and training is theoetically much faster because GPUs are GPUs.

The downside is that the implementation in the Python UMAP package isn't great and creates/pushes the whole expanded node/edge dataset to the GPU, which means you can only train it on about 100k embeddings before going OOM.

The UMAP -> HDBSCAN -> AI cluster labeling pipeline that's all unsupervised is so useful that I'm tempted to figure out a more scalable implementation of Parametric UMAP.

2 comments

It exists in cuML with a fast GPU implementation. Not sure why cuMl is so poorly known though…
I'll give that a look: the feature set of GPU-accelerated ops seems just up my alley for this pipeline: https://github.com/rapidsai/cuml

EDIT: looking through the docs it's just GPU-acceletated UMAP, not a parametric UMAP which trains a NN model. That's easy to work around though by training a new NN model to predict the reduced dimensionality values and minimizing rMSE.

Tested it out and the UMAP implementation with this library is very very fast compared to Parametric UMAP: running it on 100k embeddings took about 7 seconds when the same pipeline on the same GPU took about a half-hour. I will definitely be playing around with it more.
Yeah we advise Graphistry users to keep GPU umap training sets to < 100k rows, and instead focus on doing careful sampling within that, and multiple models for going beyond that. It'd be more accessible for teams if we could raise the limit, but quality wise, it's generally fine. Security logs, customer activity, genomes, etc.

RAPIDS umap is darn impressive tho. Instead of focusing on improving further, it did the job. Our bottleneck shifted to optimizing the ingest pipeline to feed umap, so we released cu_cat as a GPU-accelerated automated feature engineering library to get all that data into umap. RAPIDS cudf helps take care of the intermediate IO and wrangling in-between.

Downstream, we generally stopped doing DBSCAN , despite being so pretty. We replace with cugraph/GFQL on the umap similarity graph, to avoid quality issues we see in practice, and then visually & interactively investigate the similarity graph in pygraphistry. Once you can see the k-nn similarity edges - and lack thereof -- you realize why scatter plot clusterings (visual or algorithmic) are so misleading to analysts and treat with more caution. There is a variety of umap contenders nowadays, but with this pipeline, we haven't felt the need to go beyond. That's a multi-year testament to Leland and team.

The result is we can now umap and interactively visualize most real world large datasets, database query results, and LLM embeddings that pygraphistry & louie.ai users encounter in seconds. Many years to get here, and now it is so easy!

From a quick glance, it appears that it's because the implementation pushes the entire graph (all edges) to the GPU. Sampling of edges during training could alleviate this.
Indeed, TensorFlow likes pushing everything to the GPU by default whereas many PyTorch DL implementations encourage feeding data from the CPU to the GPU as needed with a DataLoader.

There have been attempts at a PyTorch port of Parametric UMAP (https://github.com/lmcinnes/umap/issues/580) but nothing as good.

Looks like there is a little motion on this topic:

https://github.com/lmcinnes/umap/pull/1103