Hacker News new | ask | show | jobs
by overlords 2341 days ago
Vowpal Wabbit has been doing this 'hashing trick' since the 200s.

It also the feature interaction, which are the same thing as a layer in transformers (all against all matrix).

So it seems like they are still catching up to where John Langford and crew were over a decade ago.

And, the vowpal wabbit approach is extremely fast to train because it's only doing stochastic gradient descent on a linear function - linear regression. Transformers are much slower to train.

EDIT: Downvoters, please see my last leaf to see why they're effectively the same. The guy responding here seems unfamiliar with all the functionality of vowpal wabbit.

2 comments

The Google paper's hashing has, as best I can see, nothing to do with the Vowpal Wabbit's 'hashing trick.'

The VW hashing trick is about hashing your input data (ie: words, fields, etc.) into an array to lower storage requirements and deal with novel data at run time.

The google paper is about ordering the intermediate states of the neural network (ie: vectors) while preserving distance. This is done so you can chunk the resulting ordered list and perform computations on individual chunks (and their neighbors).

The only thing in common I see is the fact they both use the word hashing.

They are doing the same thing - using less memory by hashing.

The hashing trick in VW hashes multiple same words into one integer, not the same as reformer, but similar to how reformer puts similar vectors together.

With VW's ngram/skipgram features, you get the same kind of effect - similar strings hash into the same hash.

So locality sensitive hashing = (is around about the same thing as) ngram/skipgram on strings plus hashing trick.

Except in Google's paper the hashing does not directly reduce memory usage in any way. It's a lossless operation on the original vectors unlike VW's lossy operation. Google's representation allows for memory reduction down the line but those mechanisms have nothing to do with hashing.
Let's think through this clearly.

Locality sensitive hashing is a way to put similar vectors into the same buckets - by omission etc. It does this by hashing, but the intent is to approximate nearest neigbours.

skipgram/ngrams turn features into other features by omission etc, and so makes similar things the same. The hashing trick then reduces memory usage.

So yes, you're right the hashing in locality sensitive hashing is different in intent, but my point is, that both these approaches are designed to be more memory and compute efficient.

And vowpal's feature interactions give you transformer layers.

Add up all these together, and they have about the same net effect.

You keep insisting that they're the same when they're not, and then you try to subtly expand your original claim of "using less memory by hashing" to "to be more memory and compute efficient" (emphasis mine), just to force them into the same bucket.

Yes, obviously locality sensitive hashing is a form of hashing. The fact that it's locality sensitive is important for this application, but you'd rather ignore that and insist on labeling them as the same thing just because they're both hashing.

http://matpalm.com/resemblance/simhash/

Simhash algorithm, the LSH i knew about (which i mistakenly thought is LSH) works exactly like VW. It is ngrams + hash.

Downvoters, please see

http://matpalm.com/resemblance/simhash/

https://en.wikipedia.org/wiki/SimHash

Simhash, a type of local sensitive hashing - using hash functions on ngrammed data.

That is exactly what Vowpal Wabbit does.

I'm going to write this out more clearly, because I'm still getting downvotes for my correct answer.

Why neural networks? https://en.wikipedia.org/wiki/Universal_approximation_theore...

Can polynomials do this? (Yes) https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theo...

What is transformer and attention? https://pathmind.com/wiki/attention-mechanism-memory-network

Attention = Polynomial (x2,x3 etc.)

Polynomial = interaction. VW flag -interaction

1 layer transformer = xx. (x^2)

2 layer tranformer = xxx. (x^3)

3 ... etc

What is reformer? Transformer where LSH is applied.

One type of LSH is SimHash. ngrams of strings, followed by 32 bit hash.

Vowpal Wabbit -n flag for ngrams.

vw -interact xxx -n2 -n3 and you get ngrams + 32 bit hash doing SGD over a vector.

This vector is equivalent to a 2 layer reformer.

Non-linear activation is not needed because polynomials are already nonlinear.

So vw + interact + ngrams (almost)= reformer encoder. (if reformer uses SimHash, then they are identical).

Transformer/Reformer have an advantage, the encoder-decoder can learn from unlabeled data.

However, you can get similar results from unlabeled data using preprocessing such as introducing noise to the data, and then treating it as noise/non-noise binary classification. (it can even be thought of as reinforcement learning, with the 0-1 labels as the reward using vw's contextual bandits functionality. This can then do what GAN's do - climb from noise to perfection).

> This vector is equivalent to a 2 layer reformer.

There is no feed forward layer, no skip connections and no layer normalization in VW. In the reformer, hashing is followed by dot products. In VW hashing just collides some tokens, followed by a linear layer.

Also, 2 layers of transformer is a little shallow. In practice it's 12-14 layers or more.

In order to be equivalent, there would need to be equally good results on translation from VW, but I've never seen it used for translation. I'm wondering why?

- hashing followed by dot product in transformer you said

- you were doing dot products at each layer to introduce non-linearity in transformer (and neural nets in general). Polynomials are already non-linear, so you don't need that. Transformer and vw -interact are polynomials. Maybe the feedforward layers and skip connections are not actually needed.

- 12 layers ? vw -interact xxxxxxxxxxxxx is 12 layers. You need a lot of memory for that, but in principle vw interact can do any number of them

These results are coming from google and their massive compute resources. If they ran vw with -interact x^13 they might get similar results.

We're really talking about polynomial approximation here, both transformer and vw used in this way. And that is in theory able to approximate any continuous function (just like neural networks).

I guess the simpler proof that they are the same thing would be: Do they work the same?