| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lmeyerov 226 days ago

Fwiw, we are heavy UMAP users (pygraphistry), and find UMAP CPU fine for interactive use at up to 30K rows and GPU at 100K rows, then generally switch to a trained mode when > 100K rows. Our use case is often highly visual - see correlations, and link together similar entities into explorable & interactive network diagrams. For headless, like in daily anomaly detection, we will do this to much larger scales.

We see a lot of wide social, log, and cyber data where this works, anywhere from 5-200 dim. Our bio users are trickier, as we can have 1K+ dimensions pretty fast. We find success there too, and mostly get into preconditioning tricks for those.

At the same time, I'm increasingly thinking of learning neural embeddings in general for these instead of traditional clustering algorithms. As scales go up, the performance argument here goes up too.

2 comments

abhgh 226 days ago

I was not aware this existed and it looks cool! I am definitely going to take out some time to explore it further.

I have a couple of questions for now: (1) I am confused by your last sentence. It seems you're saying embeddings are a substitute for clustering. My understanding is that you usually apply a clustering algorithm over embeddings - good embeddings just ensure that the grouping produced by the clustering algo "makes sense".

(2) Have you tried PaCMAP? I found it to produce high quality and quick results when I tried it. Haven't tried it in a while though - and I vaguely remember that it won't install properly on my machine (a Mac) the last time I had reached out for it. Their group has some new stuff coming out too (on the linked page).

[1] https://github.com/YingfanWang/PaCMAP

link

lmeyerov 226 days ago

We generally run UMAP on regular semi-structured data like database query results. We automatically feature encode that for dates, bools, low-cardinality vals, etc. If there is text, and the right libs available, we may also use text embeddings for those columns. (cucat is our GPU port of dirtycat/skrub, and pygraphistry's .featurize() wraps around that).

My last sentence was on more valuable problems, we are finding it makes sense to go straight to GNNs, LLMs, etc and embed multidimensional data that way vs via UMAP dim reductions. We can still use UMAP as a generic hammer to control further dimensionality reductions, but the 'hard' part would be handled by the model. With neural graph layouts, we can potentially even skip the UMAP for that too.

Re:pacmap, we have been eyeing several new tools here, but so far haven't felt the need internally to go from UMAP to them. We'd need to see significant improvements given the quality engineering in UMAP has set the bar high. In theory I can imagine some tools doing better in the future, but the creators have't done the engineering investment, so internally, we rather stay with UMAP. We make our API pluggable, so you can pass in results from other tools, and we haven't heard much from that path from others.

link

abhgh 226 days ago

Thank you. Your comment about LLMs to semantically parse diverse data, as a first step, makes sense. In fact come to think of it, in the area of prompt optimization too - such as MIPROv2 [1] - the LLM is used to create initial prompt guesses based on its understanding of data. And I agree that UMAP still works well out of the box and has been pretty much like this since its introduction.

[1] Section C.1 in the Appendix here https://arxiv.org/pdf/2406.11695

link

nighthawk454 226 days ago

I’m working on a new UMAP alternative - curious what kinds of improvements you’d be interested in?

link

lmeyerov 224 days ago

A few things

Table stakes for our bigger users:

- parity or improvement on perf, for both CPU & GPU mode

- better support for learning (fit->transform) so we can embed billion+ scale data

- expose inferred similarity edges so we can do interactive and human-optimized graph viz, vs overplotted scatterplots

New frontiers:

- alignment tooling is fascinating, as we increasingly want to re-fit->embed over time as our envs change and compare, eg, day-over-day analysis. This area is not well-defined yet common for anyone operational so seems ripe for innovation

- maybe better support for mixing input embeddings. This seems increasingly common in practice, and seems worth examining as special cases

Always happy to pair with folks in getting new plugins into the pygraphistry / graphistry community, so if/when ready, happy to help push a PR & demo through!

link

lmcinnes 223 days ago

> alignment tooling is fascinating, as we increasingly want to re-fit->embed over time as our envs change and compare, eg, day-over-day analysis. This area is not well-defined yet common for anyone operational so seems ripe for innovation

It is probably not all the things you want, but AlignedUMAP can do some of this right now: https://umap-learn.readthedocs.io/en/latest/aligned_umap_bas...

If you want to do better than that, I would suggest that the quite new landmarked parametric UMAP options are actually very good this: https://umap-learn.readthedocs.io/en/latest/transform_landma...

Training the parametric UMAP is a little more expensive, but the new landmarked based updating really does allow you to steadily update with new data and have new clusters appear as required. Happy to chat as always, so reach out if you haven't already looked at this and it seems interesting.

link

romanfll 226 days ago

The shift from Explicit Reduction to GNNs/Embeddings is where the high-end is going in my view… We hit this exact fork in the road with our forecasting/anomaly detection engine (DriftMind). We considered heavy embedding models but realised that for edge streams, we couldn't afford the inference cost or the latency of round-tripping to a GPU server. It feels like the domain is splitting into 'Massive Server-Side Intelligence' (I am a big fan of Graphistry) and 'Hyper-Optimized Edge Intelligence' (where we are focused).

link

lmeyerov 224 days ago

Interesting, mind sharing the context here?

My experience has been as workloads get heavier, it's "cheaper" to push to an accelerated & dedicated inferencing server. This doesn't always work though, eg, world of difference between realtime video on phones vs an interactive chat app.

Re:edge embedding, I've been curious about the push by a few to 'foundation GNNs', and it may be fun to compare UMAP on property-rich edges to those. So far we focus on custom models, but the success of neural graph drawing NNs & newer tabular NNs suggest something pretrained can replace UMAP as a generic hammer here too...

link