Hacker News new | ask | show | jobs
by frgtpsswrdlame 781 days ago
Judging by my twitter feed, some scientists are starting to push back on the use of UMAP (and t-SNE.)
2 comments

I think a common problem is that these techniques get repurposed to solve problems that they weren't meant to. I have seen multiple people fall too often into the trap of using these visualizations to guess whether a dataset may be classified with high accuracy. I'm talking about cases where there already is a label - but viz. is used as a prior compute-cheap step to understand whether they would bother with classification at all, or should they pick a weak-vs-strong classifier, etc.

The problem of course is the insights from viz. provide "one-sided" information: IF your instances from different classes look separated, then you know that a decent classifier would do the job well. But if they don't appear separated, you don't know whether they can't be accurately classified: for all you know you don't have the right hyperparams. Also account for the fact that you're projecting d-dimensional data down to 2D/3D - this is heavily lossy; even with the right hyperparams there is a chance you won't see high separation. If you want to classify, just classify.

What do they suggest you use instead?
PCA for starting out
As a sibling comment already mentioned that doesn't really work for non-linear data. UMAP and t-SNE are techniques used on non-linear data.
PCA performs horribly on non-linear data.
How will you know if you have non-linear data....