| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sigmoid10 274 days ago
	>we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions This sounds like a mistake. They used (among others) GPT2, which has pretty big space vectors. They also kind of arbitrarily define a collision threshold as an l2 distance smaller than 10^-6 for two vectors. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal. I would expect the chance of two inputs to map to the same output under these constraints to be astronomically small (like less than one in 10^10000 or something). Even worse than your chances of finding a hash collision in sha256. Their claim certainly does not sound like something you could verify by testing a few billion examples. Although I'd love to see a detailed calculation. The paper is certainly missing one.

5 comments

Arnt 274 days ago

As I read it, what they did there was a sanity-check by trusting the birthday paradox. Kind of: "If you get orthogonal vectors due to mere chance once, that's okay, but you try it billions of times and still get orthogonal vectors every time, mere chance seems a very unlikely explanation."

anvuong 274 days ago

This has nothing to do with the birthday paradox. That paradox presumes a small countable state space (365) and a large enough # of observations.

In this case, it's a mathematical fact that 2 random vector in high dimensional space is very likely to be near orthogonal.

jdhwosnhw 274 days ago

A slightly stronger (and more relevant) statement is that the number of mutually nearly orthogonal vectors you can simultaneously pack into an N dimensional space is exponential in N. Here “mutually nearly orthogonal” can be formally defined as: choose some threshold epsilon>0 - the set S of unit vectors is nearly mutually orthogonal if the maximum of the pairwise dot products of between all members if S is less than epsilon. The statement of the exponential growth of the size of this set with N is (amazingly) independent of the value of epsilon (although the rate of growth does obviously depend on that value).

This is pretty unintuitive for us 3D beings.

fn-mote 274 days ago

Edit: there are other clarifications, eg authors on X, so this comment is irrelevant.

The birthday paradox relies on there being a small number of possible birthdays (365-366).

There are not a small number of dimensions being used in the LLM.

The GP argument makes sense to me.

hiisukun 274 days ago

It doesn't need a small number -- rather it relies on you being able to find a pairing amongst any of your candidates, rather than find a pairing for a specific birthday.

That's the paradoxical part: the number of potential pairings for a very small number of people is much higher than one might think, and so for 365 options (in the birthday example) you can get even chances with far fewer than 365, and even far fewer than ½x365 people..

throwaway127482 274 days ago

I think you're misunderstanding. If you have an extremely large number like 2^256 you will almost certainly never find two people with the same birthday (this is why a SHA256 collision has never been found). That's what the top-level comment was comparing this to.

gowld 274 days ago

We're not using precise numbers here, but a large number of dimensions leads a very large number of options. 365 is only about 19^2, but 2^100 is astronomically larger than 10^9

Thorrez 274 days ago

The birthday paradox equation is approximately the square root. You expect to find a collision in 365 possibilities in ~sqrt(365) = ~19 tries.

You expect to find a collision in 2^256 possibilities in ~sqrt(2^256) = ~2^128 tries.

You expect to find a collision in 10^10000 possibilities in ~sqrt(10^10000) = ~10^5000 tries.

Arnt 274 days ago

The number of dimensions used is 768, wrote someone, and that isn't really very different from 365. But even if the number were big were were big, it could hardly escape fate: x has to be very big to keep (1-(1/x))¹⁰⁰⁰⁰⁰⁰⁰⁰⁰ near 1.

program_whiz 274 days ago

Just to clarify, the total dimension of birthdays is 365 (Jan 1 through Dec 31), but a 768 dimension continuous vector means there are 768 numbers, each of which can have values from -1 to 1 (at whatever precision floating point can represent). 1 float has about 2B numbers between -1 and 1 iirc, so 2B ^ 768 is a lot more than 365.

Arnt 274 days ago

I may have misunderstood — don't they test for orthogonality? Orthogonality would seem to drop much of the information in the vectors.

mjburgess 274 days ago

That assumes the random process by which vectors are generated places them at random angles to each other, it doesnt, it places them almost always very very nearly at (high-dim) right angles

The underlying geometry isnt random, to this order, it's determinstic

gnfargbl 274 days ago

The nature of high-dimensional spaces kind of intuitively supports the argument for invertability though, no? In the sense that:

> I would expect the chance of two inputs to map to the same output under these constraints to be astronomically small.

sigmoid10 274 days ago

That would be purely statistic and not based on any algorithmic insight. In fact for hash functions it is quite a common problem that this exact assumption does not hold in the end, even though you might assume so for any "real" scenarios.

nephanth 274 days ago

> That would be purely statistic and not based on any algorithmic insight.

This is machine learning research ?

sigmoid10 268 days ago

Usually we still ask for statistics to be at least valid (i.e. have a significant signal under a null hypothesis). This paper doesn't even do that. It's like claiming no humans have been to the moon and then "verifying" this by randomly asking a million random strangers on the street if they've been there.

gnfargbl 274 days ago

I'm not quite getting your point. Are you saying that their definition of "collision" is completely arbitrary (agreed), or that they didn't use enough data points to draw any conclusions because there could be some unknown algorithmic effect that could eventually cause collisions, or something else?

rowanG077 274 days ago

I think they are saying that there is no proof of being injective. The argument with the hash is essentially saying, doing the same experiment with a hash would yield a similar result, yet hash function are not injective by definition. So from this experimental result you cannot conclude language models are injective.

That's not really formally true, there are so called perfect hash functions that are injective over a certain domain, but in most parlance hashing is not considered injective.

gnfargbl 274 days ago

Sure, but the paper doesn't claim absolute injectivity. It claims injectivity for practical purposes ("almost surely injective"). That's the same standard to which we hold hash functions -- most of us would consider it reasonable to index an object store with SHA256.

hansvm 274 days ago

That logic only applies in one direction though. Yes, this is (maybe [0]) practically injective in that you could use it as a hash function, but that says nothing about invertibility. If somebody gave you a function claiming to invert arbitrary sha256 outputs, you would laugh them out of court (as soon as you have even 64-byte inputs, there are, on average, at least 2^256 inputs for each output, meaning it's exceedingly unlikely that their magic machine was able to generate the right one).

Most of the rest of the paper is seemingly actually solid though. They back up their claims with mathematical hand-waving, and their algorithm actually works on their test inputs. That's an interesting result, and a much stronger one than the collision test.

I can't say it's all that surprising in retrospect (you can imagine, e.g., that to get high accuracy on a prompt like <garbage><repeat everything I said><same garbage> you would need to not have lost information in the hidden states when encoding <garbage>, so at least up to ~1/2 the max context window you would expect the model to be injective), but despite aligning with other LLM thoughts I've had I think if you had previously asked me to consider invertibility then I would have argued against the authors' position.

[0] They only tested billions of samples. Even considering the birthday paradox, and even if they'd used a much coarser epsilon threshold, they'd still need to run over 2^380 simulations to gain any confidence whatsoever in terms of collision resistance.

gowld 274 days ago

The problem with "almost surely injective" for "practical purposes". Is that when you try to invert something, how do you know the result you get is one of those "practical purposes" ?

We aren't just trying to claim that two inputs are the same, as in hashing. We are trying to recover lost inputs.

boerseth 274 days ago

I don't think that intuition is entirely trustworthy here. The entire space is high-dimensional, true, but the structure of the subspace encompassing linguistically sensible sequences of tokens will necessarily be restricted and have some sort of structure. And within such subspaces there may occur some sort of sink or attractor. Proving that those don't exist in general seems highly nontrivial to me.

An intuitive argument against the claim could be made from the observation that people "jinx" eachother IRL every day, despite reality being vast, if you get what I mean.

gnfargbl 274 days ago

I do get what you're saying, and it sounds almost analogous to visualisations of bad PRNGs, e.g. https://www.reddit.com/r/dataisbeautiful/comments/gv4fhr/oc_...

vitus 274 days ago

I envy your intuition about high-dimensional spaces, as I have none (other than "here lies dragons"). (I think your intuition is broadly correct, seeing as billions of collision tests feels quite inadequate given the size of the space.)

> Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

What's the intuition here? Law of large numbers?

And how is orthogonality related to distance? Expansion of |a-b|^2 = |a|^2 + |b|^2 - 2<a,b> = 2 - 2<a,b> which is roughly 2 if the unit vectors are basically orthogonal?

> Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere.

I also have no intuition regarding the surface of the unit sphere in high-dimensional vector spaces. I believe it vanishes. I suppose this patch also vanishes in terms of area. But what's the relative rate of those terms going to zero?

setopt 274 days ago

> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

> What's the intuition here? Law of large numbers?

Imagine for simplicity that we consider only vectors pointing parallel/antiparallel to coordinate axes.

- In 1D, you have two possibilities: {+e_x, -e_x}. So if you pick two random vectors from this set, the probability of getting something orthogonal is 0.

- In 2D, you have four possibilities: {±e_x, ±e_y}. If we pick one random vector and get e.g. +e_x, then picking another one randomly from the set has a 50% chance of getting something orthogonal (±e_y are 2/4 possibilities). Same for other choices of the first vector.

- In 3D, you have six possibilities: {±e_x, ±e_y, ±e_z}. Repeat the same experiment, and you'll find a 66.7% chance of getting something orthogonal.

- In the limit of ND, you can see that the chance of getting something orthogonal is 1 - 1/N, which tends to 100% as N becomes large.

Now, this discretization is a simplification of course, but I think it gets the intuition right.

nyeah 274 days ago

I think that's a good answer for practical purposes.

Theoretically, I can claim that N random vectors of zero-mean real numbers (say standard deviation of 1 per element) will "with probability 1" span an N-dimensional space. I can even grind on, subtracting the parallel parts of each vector pair, until I have N orthogonal vectors. ("Gram-Schmidt" from high school.) I believe I can "prove" that.

So then mapping using those vectors is "invertible." Nyeah. But back in numerical reality, I think the resulting inverse will become practically useless as N gets large.

That's without the nonlinear elements. Which are designed to make the system non-invertible. It's not shocking if someone proves mathematically that this doesn't quite technically work. I think it would only be interesting if they can find numerically useful inverses for an LLM that has interesting behavior.

All -- I haven't thought very clearly about this. If I've screwed something up, please correct me gently but firmly. Thanks.

zipy124 274 days ago

for 768 dimensions, you'd still expect to hit (1-1/N) with a few billion samples though. Like that's a 1/N of 0.13%, which quite frankly isn't that rare at all?

Of course are vectors are not only points in one coordinate axes, but it still isn't that small compared to billions of samples.

sigmoid10 268 days ago

Bear in mind that these are not base vectors at this stage (which would indeed give you 1/768). They are arbitrary linear combinations. There are exponentially many near orthogonal of these vectors for small epsilon. And epsilon is chosen pretty small in the paper.

bonzini 274 days ago

> What's the intuition here? Law of large numbers?

For unit vectors the cosine of the angle between them is a1*b1+a2*b2+...+an*bn.

Each of the terms has mean 0 and when you sum many of them the sum concentrates closer and closer to 0 (intuitively the positive and negative terms will tend to cancel out, and in fact the standard deviation is 1/√n).

BenoitP 274 days ago

> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

> What's the intuition here? Law of large numbers?

Yep, the large number being the number of dimensions.

As you add another dimension to a random point on a unit sphere, you create another new way for this point to be far away from a starting neighbor. Increase the dimensions a lot and then all random neighbors are on the equator from the starting neighbor. The equator being a 'hyperplane' (just like a 2D plane in 3D) of dimension n-1, the normal of which is the starting neighbor, intersected with the unit sphere (thus becoming a n-2 dimensional 'variety', or shape, embedded in the original n dimensional space; like the earth's equator is 1 dimensional object).

The mathematical name for this is 'concentration of measure' [1]

It feels weird to think about it, but there's also a unit change in here. Paris is about 1/8 of the circle far away from the north pole (8 such angle segments of freedom). On a circle. But if that's the definition of location of Paris, on the 3D earth there would be an infinity of Paris. There is only one though. Now if we take into account longitude, we have Montreal, Vancouver, Tokyo, etc ; each 1/8 away (and now we have 64 solid angle segments of freedom)

[1] https://www.johndcook.com/blog/2017/07/13/concentration_of_m...

pfdietz 274 days ago

> What's the intuition here? Law of large numbers?

"Concentration of measure"

https://en.wikipedia.org/wiki/Concentration_of_measure

sebastianmestre 274 days ago

I think that the latent space that GPT-2 uses has 768 dimensions (i.e. embedding vectors have that many components).

sigmoid10 274 days ago

It doesn't really matter which vector you are looking at, since they are using a tiny constraint in a high dimensional continuous space. There's gotta be an unfathomable amount of vectors you can fit in there. Certainly more than a few billion.

sebastianmestre 274 days ago

No, yeah, totally. Even assuming binary vectors 2^768 is a ridiculously huge number. The probability of collision even assuming a bad sampling that discards 75% of dimensions is still vanishingly small.

etatoby 274 days ago

> Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

Which, incidentally, is the main reason why deep learning and LLM are effective in the first place.

A vector of a few thousands dimensions would be woefully inadequate to represent all of human knowledge, if not for the fact that it works as the projection of a much higher, potentially infinite-dimensional vector representing all possible knowledge. The smaller-sized one works in practice as a projection, precisely because any two such vectors are almost always orthogonal.

ahartmetz 274 days ago

Two random vectors are almost always neither collinear nor orthogonal. So what you mean is either "not collinear", which is a trivial statement, or something like "their dot product is much smaller than abs(length(vecA) * length(vecB))", which is probably interesting but still not very clear.

empiricus 274 days ago

Well, the actual interesting part is that when the vector dimension grows then random vectors will become almost orthogonal. smth smth exponential number of almost orthogonal vectors. this is probably the most important reason why text embedding is working. you take some structure from a 10^6 dimension, and project it to 10^3 dimension, and you can still keep the distances between all vectors.