Hacker News new | ask | show | jobs
by mudrockbestgirl 1291 days ago
That's a great summary, but it's important to understand that much more goes into training these models. The architecture is not any kind of secret sauce, or special in any way. It's just a typical Transformer. I call this "architecture porn" - people love looking at neural net architectures and think that's the key to success. If only you know the algorithm! It's so simple!

But reality is usually much messier. The real training code will be littered with hundreds of ugly little tricks to make it work. A large part of it will be input preprocessing and data engineering, tricks to deal with exploding/vanishing gradients, monitoring, learning rate schedules and optimizer cycling, complexity for distributed training, regularization tricks, changing parts of the architecture for performance reasons (like attention), and so on.

5 comments

I'm far from an expert in this field, but based on my conversations with people who are I think this is getting less true. Normally these models are trained with straightforward optimizers (basically naive SGD) since advances like batch normalization and residual connections make the more fancy stuff unnecessary. I think the learning rate schedules used for these big networks tend to be simple as well, just two or three steps.
I work in this field (PhD candidate), and what you say is true for smaller models, but not GPT-3 scale models. Training large scale models involved a lot more, as the OP said. It's not just learning rate schedulers, it's a whole bunch of stuff.

See this logbook from training the GPT-3 sized OPT model - https://github.com/facebookresearch/metaseq/blob/main/projec...

Seems like majority of problems in this log are devops problems, which seems to be combination of ML people doing devops work while not having experience with devops work and really bad cloud vendor. I've been running multiple bare metal nodes with 8 GPUs each running 24/7 for months with almost 100% utilization and had 100x less problems than they had.
it is neither as simple as the person you are responding to, nor as complicated as you make it seem. it will only get simpler with time.
so creating each new rev of GPT3 would involve going through something like all those messy steps in that logbook?
You put the finger on exactly what I find incredible about the recent progress in ML - the reason I wrote this post was to see how much I could de-mystify these state-of-the-art models for myself, and the conclusion is that (after the model is trained) it all really boils down to a couple of matrix multiplications! All the impressive results we see, they're not coming from an extremely complicated system ('complicated' like a fighter jet is, with many different subsystems, which you'd need to read many books to memorize).

Of course, there's all the secret sauce to actually getting the models to learn anything, and all the empirical progress we make to make the training more efficient (ReLUs, etc). But how many of those are fundamental, vs. simply efficiency shortcuts? And: if you'd asked me 10 years ago what I thought it would take to get the kind of output these large models are getting these days, I would not have guessed anything nearly as simple as what those models actually are.

Don't know. Karpathy has a very compact implementation of GPT [0] using standard technology (could be even more compact but is reimplementing for example the attention layer for teaching purposes) and while he presumably has no access to how the real model was trained exactly, if there would be more to it I think he would be the kind of person to point it out.

[0] https://github.com/karpathy/minGPT/tree/master/mingpt

I‘ve recently come to the conclusion that the magic of fully connected neural networks is that there are almost no tricks to reach close to sota. Dense layers + relu + adam = it just works
Sorry but this is just wrong, using only fully connected layers would result in pretty bad performance on images, text, audio, etc., or at the very least require much more data to perform well. At least use the right type of architecture for each data modality, then I agree that the basic version won't perform much worse than sota in the real world.
I think part of parent is wrong but part is correct.

There are many rules of thumb that took the last 5+ years to discover but are now quite standard. You are nit picking on fully connected, but if we add dropout, weight initialization, and adaptive learning rate to what they said, then we are fairly close to being able at least get a deep architecture to overfit a toy dataset and be off to the races for then applying it to a larger dataset.

The smart money should be on research on current shortcomings that will become deal breakers when AI is fully pervasive in society. For example, addressing catastrophic forgetting seems to me to be a very profitable research aim.
Maybe I wasn't clear enough but of course I'm not implying that you can reach sota on image classification with fcnns. There are many problems where the input space is not as noisy, redundant and structure bearing as with images.
I used to work in data engineering for ML and yes, I'd say 90% of our technical expertise on both the science and engineering side went into designing the datasets.
It feels like this is less true for GPT though, especially as OpenAI seems to be adopting a 'kitchen sink' approach.
Just getting plain text out of the web without getting flooded with boilerplate, noise, SEO spam, duplication, infinity pages like calendars etc is already a hard data engineering problem.
You're only thinking of the training data. But the pre-trained model is like a newborn, trashing and yelling and not listening. It needs a second level of training made of a mix of about 1800 supervised tasks. Now it has progressed a little, you can get it to listen, but it's still not ok, it's like a 5 year old. You need to label more data with human preferences and fine-tune the model to align it with what we think is good behaviour. Now it behaves like a 10 year old.

In the original dataset you already combine dozens of sources - web scrapes, book collections, paper collections, materials in many languages, etc. In the second stage you have thousands of small supervised datasets. In the third stage you have to label. So I think the dataset building phase is pretty difficult.