Hacker News new | ask | show | jobs
by smu3l 4205 days ago
Here's some commentary on the article: http://www.win-vector.com/blog/2014/12/a-comment-on-preparin...
2 comments

Great commentary. Encoding high-level categorical variables as unordered integers, or even frequency count maps, is pretty common practice, but really only valid (as the author says) in tree based methods. Given deep enough trees (or enough iterations in the boosting scenario), it is my anecdotal experience that trees are able to parse out relevant values. The problem with indicator variables in general is the enormous potential loss in computational time and increase in complexity. My models have hundreds of vehicle/car types, U.S. state, etc., and so this type of coding helps quite a bit in performance without sacrificing (and sometimes improving due to curse-of-dimensionality) model performance (IMHO YMMV)
From the commentary: "we are always more accepting of an expected outcome", which I think says a lot. Random Forests are great, and work well, but sometimes it feels like there's a world of difference between Kaggle-contest-style short burst projects and long-term "we need maximum accuracy" projects.

Other approaches always have a place, and the skill is in knowing when they should be deployed.