Hacker News new | ask | show | jobs
by alecst 492 days ago
This sounds logical enough but I think I remember reading that larger embeddings were only better for certain classes of words, because increasing the dimension size (in those cases) introduced noise. Will try to find the paper — it’s escaping me at the moment.
3 comments

That's all true but matters significantly less "at scale". In the days of lean models, you needed to verify that your input parameters were functionally independent variables, meaning they couldn't correlate with other input parameters. When every document is transformed into a billions-long vector -- even if you took the noninsignificant amount of time it would take to compute a correlation matrix -- the heavy associations between a few features don't mean much, especially when you can just add more data. Plus, people misusing or repurposing words can introduce some interesting twists to features you'd assume 1:1 on paper.
Does the paper reach some conclusion on an optimal embedding size?

I was thinking about that the other day. It's interesting from a linguistics perspective. I wonder if each dimension in an optimal-size vector could be given a human-comprehensible label.

Isn't it all about the context too?