Hacker News new | ask | show | jobs
by HaZeust 698 days ago
Probably because the benchmarks with higher models are, at this time, negligible. Increasing transformers and iterating attention might be a dead-stop for more capable models beyond 2T parameters. But, I'm not sure.