Hacker News new | ask | show | jobs
by lumost 852 days ago
For small values of N, the linear terms of the transformer dominate. At the end of the day, a double layer of 764*2048 is still north of 3.1 MM flops/token/layer.