Hacker News new | ask | show | jobs
by datahipster 4820 days ago
Meh. Diving a little bit more into the results on how well the hash kernels algorithm did with UCI Adult Names data set is mildly disappointing. Take a look at the results (http://archive.ics.uci.edu/ml/machine-learning-databases/adu...) and you'll see that hash kernels rank 14 of 17.

To be fair, I would definitely like to see how this algorithm does with other data sets as well.

All I see this algorithm doing is basically a projection of a highly dimension feature set onto a random n-d projection via a hashing function. Another words, it's not clear to me how an optimal classification boundary can be constructed using this random projection. I feel comfortable with understanding the performance characteristics of techniques such as SVM or vector quantization since they both focus on implementing algorithms that optimally reduce the dimensionality of the feature space. However, random projects are, in my humble opinion, overrated.

Also, there are two additional sources of parameterization that might make it difficult to use this tool: the selection of a hash function (and accounting for the information loss via a random projection) along with the arbitrary hash kernel array size. That makes it somewhat difficult to train and validate a model.

However! I really do like the ideas presented here and look forward into diving into this and other variants. I really appreciate the OP for taking the time to piece together this cool tool and compare it's performance with other algorithms.