Hacker News new | ask | show | jobs
by savant_penguin 1415 days ago
A important point is that it's an absolute pain in the ass to preprocess tabular data for neural networks.

Categorical > one hot encoding > deal with new categories in test time (sklearn does this, but it's really slow and clunky)

Numerical > either figure it out the data distribution for each column and normalize by that or normalize everything by z score. Found an outlier?? Oops, every feature collapsed to 0

Can you that for 10 features? Sure, now try it again with 500, it's not fun

Ok, now that you've done all that you can begin training and possibly get some reasonable result.

Compare that with tree models: data>model>results

1 comments

You could encode categorical variables into embeddings, they are more natural to deal that way.

https://tech.instacart.com/deep-learning-with-emojis-not-mat...