|
|
|
|
|
by rm999
4818 days ago
|
|
This is pretty similar to the approach that many predictive modelers already use: compress data down into 'dummy variables', where each variable represents some attribute. For example, you could convert the variable state into 51 dummy variables, one for california, one for DC, etc. Hashing makes the programming a little easier and helps avoid throwing out data in the long-tail of the distribution when the number of possible values is very high. It's common in NLP exactly because of this. Hashing comes with costs though. It's much harder to interpret a model with hashed variables because the variables lose meaning. Also, some information may get thrown out; as OP mentions integers may have continuous meaning and categorizing them can really damage the model. If you have a very large dataset it probably won't matter because most modeling methods asymptotically converge (i.e. as you have more and more data the model can learn that 333 and 334 are similar just by seeing enough examples that this is the case), but if you don't you could be throwing out valuable information. In a case like this I suppose the modeler could manually go in and convert his integers to floats to 'tell' the algorithm the data is continuous. |
|
When you say "hashing makes the programming a little easier," I think you hit the nail on the head. I'm not trying to improve classification accuracy -- my goal was just to make it as easy as possible to learn on arbitrary structured data.