Hacker News new | ask | show | jobs
by markpapadakis 2379 days ago
What does each of those 200 dimensions represent/encode?

This is a very interesting and pragmatic design. I tried a couple of queries and the results were great for most of them, but not so much for esoteric ones, or for queries involving names, which makes sense according to the implementation specifics outlined in their blog posts.

I never heard of this company before, and I am very interested in IR technologies( https://github.com/phaistos-networks/Trinity ). I am glad they are trying and it looks like they have a chance for something great there.

Ahrefs.com is also working on a large scale web search engine, but other than their CEO announcing it on Twitter, there hasn’t been any other update since.

4 comments

By default, there's no attempt to connect a particular semantic meaning with a particular dimension - it's worth noting that all the popular methods of calculating word vectors can/will give results that can differ by an arbitrary linear transformation, so in the event that they contain a very particular "semantic bit", it's still likely to be "smeared" across all 200 dimensions - you could have a linear (unambigious, reversible) transformation to a different 200-dimension space where that particular factor is isolated in a separate dimension, but you would have to explicitly try and do that.

So the default situation is that each individual dimension means "nothing and everything"; if you had some specific factors which you know beforehand and that you want to determine, then you could calculate a transformation to project all the vectors to a different vector-space where #1 means thing A, #2 means thing B, etc - for example, there some nice experiments with 'face' vectors that can separate out age/masculinity/hair length/happiness/etc out of the initial vectors coming out of some image analysis neural network with an unclear meaning of each separate dimension.

Even without a priori semantic goals, I think you can also transform/rotate your dimensions to maximize their interpretability.

Simplified example: if your two-dimensional system gives you two points:

    -0.5, 0.5
    0.5, 0.5
Then you losslessly rotate to

    0, 1.0
    1.0, 0
With the idea that the latter is simpler for humans to assign semantics to
This is the idea behind non-negative matrix formulation (NMF). As the name implies, it forces the entries of the embedding matrices (for both the reduced document and term matrix) to be nonnegative, which results in a more interpretable “sum of parts” representation. You can really see the difference (compared to LSA/SVD/PCA, which does not have this constraint) when it’s applied to images of faces. Also, NMF has been shown to be equivalent to word2vec. The classic paper is here: http://www.cs.columbia.edu/~blei/fogm/2019F/readings/LeeSeun...

PS—There should be a negative sign on the (2,2) entry of the first matrix.

> non-negative matrix formulation (NMF)

*factorization ;)

Also PCA follows a similar idea as well (I mean, rotating vectors), but it's usually done is a much lower dimensional space

Ugh, that one was auto-correct, I swear. I have no idea what’s going on at Apple’s NLP department.
Is this the same intent that a 'variation autoencoder' would perform?

Also, is it possible in non-variational implementations (like this one) that some of the dimensions represent multiple groups? For example, not just 0.5 and -0.5 groups, but also a 0.0 group in the middle. Then your rotation wouldn't be sufficient, you would need to increase the dimensionality to cleanly separate the groups.

In general, the individual coordinates of a word embedding (and hence a sentence embedding) have no semantic meaning, at least not in any "normal" sense.

The n-dimensional space into which the embeddings have been represented is mainly just a space such that similarity between two vectors/embeddings in this space has some semantic meaning, e.g. query or word similarity.

A few more details are here: https://www.quora.com/Do-the-dimensions-of-Word2Vec-have-an-...

My favorite paper for introducing the idea is this oldie but goodie: http://lsa.colorado.edu/papers/dp1.LSAintro.pdf

The math behind Word2Vec and GloVe is different. Most word embedding models have been shown to be equivalent to some version of matrix factorization, though, and, at least if you're comfortable with linear algebra, the SVD formulation makes it relatively easy to get an intuitive grasp on what the dimensions in an embedding really "mean".

Word embeddings are unsupervised learning, so the features are not chosen, only the number of features. The model then learns the scalars for each feature as a single vector depending on the algorithm/architecure.

When using CBOW, for instance, with a set window size N, the features learned for a single term are based on the order of the preceding N terms.

This will result in similar vectors for terms appearing in the same context. It has its pros and cons though - a great example being “the reservation is confirmed” vs “the reservation is cancelled” - where confirmed/cancelled will have similar features.

It seems odd to me that they ended up computing the average of the word vectors which is way too simple to represent meaning. Similarity metric over this semantic vector is not accurate.
Just a couple of things:

- We are actually using a weighted average.

- This is just one of the techniques used for recall. The aim at this point is to be super fast and scalable, and not super accurate. Once we reach the precision stage of the ranking (as defined in https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...), we can afford to do fancier matching.

/ One of the authors

Weighted based on what, do you keep IDF values of some sort?

Even then it's hard to imagine how this performs great if these are plain Word2vec vectors. Saying it's just the recall step is a bit hand wavy as these will be actually selecting the documents you will be performing additional scoring on and may very well end up excluding a multitude of great results.

In any case, once more these are very interesting to read and as a search nerd, and I can't help but wonder about all the alternatives considered.

We do have our own custom word (piece) embeddings that we have trained on <good query, bad query>-pairs. There are a few more details about it in https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr....