Hacker News new | ask | show | jobs
by elliott34 4205 days ago
Great commentary. Encoding high-level categorical variables as unordered integers, or even frequency count maps, is pretty common practice, but really only valid (as the author says) in tree based methods. Given deep enough trees (or enough iterations in the boosting scenario), it is my anecdotal experience that trees are able to parse out relevant values. The problem with indicator variables in general is the enormous potential loss in computational time and increase in complexity. My models have hundreds of vehicle/car types, U.S. state, etc., and so this type of coding helps quite a bit in performance without sacrificing (and sometimes improving due to curse-of-dimensionality) model performance (IMHO YMMV)