| HN Mirror

That's all true but matters significantly less "at scale". In the days of lean models, you needed to verify that your input parameters were functionally independent variables, meaning they couldn't correlate with other input parameters. When every document is transformed into a billions-long vector -- even if you took the noninsignificant amount of time it would take to compute a correlation matrix -- the heavy associations between a few features don't mean much, especially when you can just add more data. Plus, people misusing or repurposing words can introduce some interesting twists to features you'd assume 1:1 on paper.