|
|
|
|
|
by islewis
622 days ago
|
|
> "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime." I haven't fully ingested the paper yet, but it looks like it's focused more on compute optimization than the size of the dataset: > ... and (2) are fully parallelizable during training (175x faster for a sequence of length 512 Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale. |
|
This! Not just fastest but with the lowest resources in total.
Fully connected neural networks are universal functions. Technically we don’t need anything but a FNN, but memory requirements and speed would be abysmal far beyond the realm of practicality.