|
|
|
|
|
by akomtu
1212 days ago
|
|
By "parameters" they probably mean float32s, and 65B of those is 0.25 TB of data - more than enough to memorize a 1.5T sequence of "tokens" (3 letter triplets?). This begs the question: are these models better than a fuzzy hash table? |
|
Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).
That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.
Your question feels like it has a motive though. What are you really asking?