Hacker News new | ask | show | jobs
by freecodyx 1102 days ago
the main thing about LLM's in my opinion is the tokenization part, words are already clustered and converted into numbers(vectors) it's already a big deal. we are using learned weights, the attention part feels like a brute force approach to learn how those vectors are likely used together (if you add positional encoding as an additional information).

statistics on large amount of amount of data just seems to work after all.

1 comments

This is wrong, byte-level models work fine, even if not as well as word-level models. From comparison of byte-level models and word-level models, we know tokenization part is responsible for minuscule part of performance.