Hacker News new | ask | show | jobs
by fiatmoney 4818 days ago
Essentially random projection, which has good theoretical justification and comes in handy quite often, for instance, in SVMs [1]. I'd be concerned about using something as naive as CRC32 though, ideally they'd be using a cryptographic-strength hash fn.

There's also an entire area of research around "semantic hashing" and local embedding, that starts with such a random projection, and tries to improve the mapping to be better at a certain task while still remaining low-dimensional.

[1] http://arxiv.org/pdf/1211.6085v2.pdf

3 comments

I think this is actually quite different from random projection, which creates linear combinations of several continuous variables. RP is much more similar to PCA in that respect. OP's method works on a single categorical variable and essentially shuffles the possible values, then 'folds' them down to an arbitrary size (the size of your hashing function).

>ideally they'd be using a cryptographic-strength hash fn

Cryptographic adds nothing; there is no danger from someone reversing the hash (and if there is you may have bigger problems). Any hash function that is suitably random in its redistribution of the variables should suffice.

It's a sparse random matrix, but definitely in the same family of techniques.

I was concerned more about distributional properties of different hash functions than reversibility. Checksums-as-hashes often work, but it's not that much work to swap in something a little more robust. If you want a good output distribution, pick something where that's a goal of the algorithm.

The cited Weinberger article says:

  Different from random projections, the hashing-trick
  preserves sparsity and introduces no additional overhead 
  to store projection matrices.
The storage of random projection matrices is puzzling, because there's a well-known trick in always building the projection matrices from a known seed. It gives you the right distribution properties, and none of the storage woes. With a sufficiently fast RNG, this could actually be faster than storing it in main memory because of cache pollution.
while cryptographic hashes would fit the bill, there's plenty non-cryptographic hashes with equally good key distribution properties and avalanche behaviour, that are an order of magnitude faster to compute (some, like the FNV hash are even super-easy to implement yourself). These are the hashes used in hash-tables etc, not cryptographic hashes.