Hacker News new | ask | show | jobs
by joaogui1 760 days ago
I would say 2 big problems are:

1. latency, which would get worse if you have to sequentially generate more output

2. These models very roughly turn tokens -> "average meaning" on the embedding layer, followed by attention layers that combine the meanings, and feed forward layers that match the current meaning combination to some kind of learned archetype/prototype almost. When you move from word parts to characters all of that becomes more confusing (what's the average meaning of a?) and so I don't think there are good enough techniques to learn character-based models yet