|
|
|
|
|
by mudrockbestgirl
1291 days ago
|
|
That's a great summary, but it's important to understand that much more goes into training these models. The architecture is not any kind of secret sauce, or special in any way. It's just a typical Transformer. I call this "architecture porn" - people love looking at neural net architectures and think that's the key to success. If only you know the algorithm! It's so simple! But reality is usually much messier. The real training code will be littered with hundreds of ugly little tricks to make it work. A large part of it will be input preprocessing and data engineering, tricks to deal with exploding/vanishing gradients, monitoring, learning rate schedules and optimizer cycling, complexity for distributed training, regularization tricks, changing parts of the architecture for performance reasons (like attention), and so on. |
|