|
|
|
|
|
by viraptor
318 days ago
|
|
Do we really have the data on this? I mean, it does happen on a smaller scale, but where's the 300B version of RWKV? Where's hybrid symbolic/LLM? Where are other experiments? I only see larger companies doing relatively small tweaks to the standard transformers, where the context size still explodes the memory use - they're not even addressing that part. |
|
We know that transformers have the smallest constant in the neural scaling laws, so it seems irresponsible to scale another architecture class to extreme parameter sizes without a very good reason.