Hacker News new | ask | show | jobs
by rm999 4818 days ago
This is pretty similar to the approach that many predictive modelers already use: compress data down into 'dummy variables', where each variable represents some attribute. For example, you could convert the variable state into 51 dummy variables, one for california, one for DC, etc. Hashing makes the programming a little easier and helps avoid throwing out data in the long-tail of the distribution when the number of possible values is very high. It's common in NLP exactly because of this.

Hashing comes with costs though. It's much harder to interpret a model with hashed variables because the variables lose meaning. Also, some information may get thrown out; as OP mentions integers may have continuous meaning and categorizing them can really damage the model. If you have a very large dataset it probably won't matter because most modeling methods asymptotically converge (i.e. as you have more and more data the model can learn that 333 and 334 are similar just by seeing enough examples that this is the case), but if you don't you could be throwing out valuable information. In a case like this I suppose the modeler could manually go in and convert his integers to floats to 'tell' the algorithm the data is continuous.

1 comments

Your point regarding lost meaning is quite salient. It can be useful and enlightening when a learner reports a measure feature importance (such as in random forest models, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home....).

When you say "hashing makes the programming a little easier," I think you hit the nail on the head. I'm not trying to improve classification accuracy -- my goal was just to make it as easy as possible to learn on arbitrary structured data.

>my goal was just to make it as easy as possible to learn on arbitrary structured data

I'd be very careful about throwing arbitrary data at your learner, at least if you don't understand your data well. Oftentimes the predictors and response are not properly separated in the same way they will be during real-world usage (for example, in time); this leads to target leaks, where your model is effectively cheating by using data it won't have in production.

Target leaks are obvious when the classifier performs suspiciously well on in-sample test data, but sometimes the repercussions are more subtle but still very damaging in a production environment.

Hm, couldn't a hybrid approach deal with this? Eg hash all the data except a few dimensions you think are vital, and add those to the resulting hashed array?
Or location-aware hashing?
The hashing is not for security though, so why not keep a store of your hash + key. Again, added overhead but you wouldn't have to hash twice and you could just use the mapping table for debugging, rather than in operational code at the expense of resources.