Hacker News new | ask | show | jobs
by rckoepke 1228 days ago
I've been trying to use convex hulls to explore ML embedding spaces. However, the dimensionality (768+ dimensions) seems to crash established options like QHull[0], even with 64GB RAM (and 16 CPU cores, albeit libqhull is not multi-threaded).

Are there more appropriate algorithms for finding convex hulls where dimensions are ~768? Or any parallelized / GPU-optimized options that I should look into?

0: http://www.qhull.org

2 comments

> Qhull does not support triangulation of non-convex surfaces, mesh generation of non-convex objects, medium-sized inputs in 9-D and higher, alpha shapes, weighted Voronoi diagrams, Voronoi volumes, or constrained Delaunay triangulations,

You are going to have to roll your own.

One trick you can use is that most convex hull algorithms chase O(nlg(n)). That lg(n) implies a branching step which lowers efficiency on GPUs. Your coefficients in high dimensions likely mean an O(n^2) branchless algorithm could run faster on a GPU.

Cull points aggressively too, for what little that is worth in high dimensions.

I found https://www.sciencedirect.com/science/article/abs/pii/S01678... which looks like it could be a starting point.

The real problem is that in dimensions that high, the point set probably already is the hull and all this is a zero signal gain operation.

> The real problem is that in dimensions that high, the point set probably already is the hull and all this is a zero signal gain operation.

Well, if I have 10,000 samples of a 768-dimension volume, most of those points will probably be inside the volume, and not per se a vertex of the hull.

I’m very comfortable rolling my own solution, so thank you for pointing me to Jarvis’ algorithm!

So, about that. Do the math on how many faces a 768-simplex has.
Revisiting this. Isn't it a bit of a red herring to enquire about the number of 2-faces that an n-simplex has? It still only has n+1 vertices. A 768-simplex may have 75.5 million faces but it will still only have 769 vertices which completely define the shape. So why would I expect a large number of the other >90% of the 10,000 samples I have to lie on the surface, rather than inside the interior volume?

To be more direct, what's the specific relevance of bringing up the number of 2-faces that an n-simplex has?

So you won't be able to define a hull at all without (n+1) points. He has 10k points so that sounds like a lot relatively.

So the definition of a convex hull is, put generally, the set of points that define faces such that every point is on the face or on the "inside" of it (mean point side)

To test if a point is inside that simplex hull, you need to check every one of those faces. But that isn't the problem.

Every face is a "filter" that all the other points have to pass. Moving beyond the simplex, at higher dimensions, the number of faces that need to "accept" every other point scales faster in higher dimensions. The odds of that aren't horribly clear, you are right in other comments to call out that the structure of points in this context is by definition not independent or random, but you need enough structure to get around the fact that high dimensional hulls are basically all surface.

I believe the answer is:

(n+1)!/((k+1)!*(n-k)!) where n=768 and k=2

Or about 75.5 million triangular faces. Which explains a lot. Thanks for that.

Depends on how complicated the branches are. One solution to branching on GPUs is to compute every branch, which is only a constant-time increase in computation. I find it hard to believe there are many cases where n^2 would be faster than lg(n) for large n.
Due to the curse of dimensionality, more of the points are pushed toward the hull in higher dimensions. So I don't think this is likely to be very effective as a data exploration technique as stated.

There may be more general expected value formulas for higher dimensions, I only know of examples in two-dimensions offhand, https://blogs.sas.com/content/iml/2021/12/06/expected-number....

There may be smarter reduced form embeddings though to make pretty pictures, e.g. https://www.youtube.com/watch?v=sD-uDZ8zXkc&ab_channel=Cynth....

I wonder if the curse of dimensionality would necessarily apply. In general, yes, most of the points in a n-dimensional volume lie near the surface in high dimensional spaces. While true, this seems mostly to be an issue when sampling random, independent points from the space. For an ML problem, all of the points are likely dependent on a few underlying processes.

Wikipedia has more on this: https://en.wikipedia.org/wiki/Curse_of_dimensionality

> This is often cited as distance functions losing their usefulness (for the nearest-neighbor criterion in feature-comparison algorithms, for example) in high dimensions. However, recent research has shown this to only hold in the artificial scenario when the one-dimensional distributions R \mathbb {R} are independent and identically distributed.[13] When attributes are correlated, data can become easier and provide higher distance contrast and the signal-to-noise ratio was found to play an important role, thus feature selection should be used.

> More recently, it has been suggested that there may be a conceptual flaw in the argument that contrast-loss creates a curse in high dimensions. Machine learning can be understood as the problem of assigning instances to their respective generative process of origin, with class labels acting as symbolic representations of individual generative processes. The curse's derivation assumes all instances are independent, identical outcomes of a single high dimensional generative process. If there is only one generative process, there would exist only one (naturally occurring) class and machine learning would be conceptually ill-defined in both high and low dimensions. Thus, the traditional argument that contrast-loss creates a curse, may be fundamentally inappropriate. In addition, it has been shown that when the generative model is modified to accommodate multiple generative processes, contrast-loss can morph from a curse to a blessing, as it ensures that the nearest-neighbor of an instance is almost-surely its most closely related instance. From this perspective, contrast-loss makes high dimensional distances especially meaningful and not especially non-meaningful as is often argued.