|
|
|
|
|
by LeanderK
1291 days ago
|
|
It has long been experimentally shown that neural network do in fact generalise and do not just memorise the training samples. What we do not see here is the convergence of the empirical distribution to the ideal distribution, the data is too sparse, the dimensionality too high. The amount of data is undoubtably enormous but it is not so simple. Only years and years of research have lead to models that are capable of learning such enormous amounts of data, while we can also see steady improvements on fixed datasets which means we in facto do make real progress on quite a lot of fronts. More data-efficiency would be great but at least we do have those datasets for language-related tasks, also it has been shown that fine-tuning is working quite well which might be a way to escape the dreaded data-inefficiency of our learning models. In the end, we are not really in the business of copying the brain but creating models that learn from data. If we arrive at a model that can solve the problem we are interested in through different means than a human would, e.g. first pre-train on half of the internet and then fine tune on your taks, we would be quite happy and it would not be seen as a dealbreaker. Of course, we would really like to have models that learn faster or have more skills, but it's amazing what's possible right now. What I find inspiring is how simple the fundamental building blocks are that our models are composed of, from gradient descent to matrix multiplication to Relus (just a max(x,0)). It's not magic, just research. |
|
Transformers famously employ the Softmax activation inside the attention matrix. Very rare to see Softmax anywhere other than the final layer.