| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by elliott34 4252 days ago
	Great commentary. Encoding high-level categorical variables as unordered integers, or even frequency count maps, is pretty common practice, but really only valid (as the author says) in tree based methods. Given deep enough trees (or enough iterations in the boosting scenario), it is my anecdotal experience that trees are able to parse out relevant values. The problem with indicator variables in general is the enormous potential loss in computational time and increase in complexity. My models have hundreds of vehicle/car types, U.S. state, etc., and so this type of coding helps quite a bit in performance without sacrificing (and sometimes improving due to curse-of-dimensionality) model performance (IMHO YMMV)