Current AI/ML is in fact a reflection on training data. Change the manifold, and change the response. It's unfortunate that of course the data is the entire internet.
GP mentioned that the current slate of transformer based AIs are not transformative in the same way the Internet was. Rather it's more of a triumph of data engineering practices.
OP disagrees with GP. OP's main thesis is that AI enables a lot new applications. OP claims that GP is simply looking at it as if it were training data.
I stated that current AI techniques ARE indeed just reflections of the data used in training. I agree with GP that the current "AI"s are simply not transformative in the same way the Internet was.
If you change the training data for the current generation of AI, you get different behaviours. The training data forms a manifold - which you can think of as a landscape with features forming valleys and hills. What the current generation of AI does is that it tries to find a shape that fits the landscape - think of it like taking a very large sheet of cloth to cover a landscape. The stiffer the cloth, the less well the cloth fits to the landscape. The "stiffness" of the cloth is the amount of parameters that a neural network has. Modern deep nets are highly overparameterized - imagine a very soft pliable cloth - of course it fits to a landscape well.
So if you have a different training data - the neural network will fit to this different landscape as well. Hence the response will be different.
It's unfortunate that the training data is the entire internet for a few reasons:
1. Only the rich can train a vaguely competent AI. You're at the whims of those well-resourced enough.
2. There's no "alternate" training dataset anymore. (Though a clever thing people at OpenAI are doing are Mixture of Experts models, where you train multiple NNs using different subsets of the full training set, so you get multiple competencies)
But you are specifically talking about one type of AI, which is a generative language model. There are tons of other AIs with different applications that do not need to be trained on the entire internet. You have computer vision which separates in object recognition, classification, OCR, etc; you have audio which has text-to-speech (and reverse), music generation, and all sorts of other things; machine translation; sentiment analysis (I won't list all the categories in hugging face but you get my point). These are not differentiated merely by 'training data' to my understanding, so that's why your comment didn't make sense to me.
Calling all AI LLMs is like calling all of the internet the web. Of course if I am mistaken, corrections are welcome.
I agree. There are other types of AIs with different applications that do not need to be trained on the internet. The examples you have given however, are examples where the deep nets are extremely data hungry.
Take computer vision for example - a "hello world" version of object recognition would use ImageNet, which is 14 million hand annotated images. Or Cifar10 which is 80 million images. That of course but sets the stage for training data differentiation. Google's image recognition algorithm is far superior to other search engines'. Why? Because of Google's data set.
Any Tom Dick and Harry can go create their own image recognition AI and train it based on all the public datasets (COCO, CIFAR, ImageNet) but that's considered pretty baseline nowadays. The differentiator is what _other_ datasets you have.
Different datasets yield different results. It doesn't matter the network. More data is better (usually).
> But you are specifically talking about one type of AI, which is a generative language model.
...Because that's easily and widely understood to be what people mean in recent times when they're talking about "AI", referring to the stuff that's in the news, without further qualifiers.
If you want to talk about something more specific, you are going to need to be explicit about it, rather than expecting everyone else to understand what you've got in your head without actually saying it.
This is like saying "but "crypto" means so much more than just cryptocurrency! there's a whole cryptography field out there that does lots of good stuff!" It's true, but it's not helpful, because it's ignoring the obvious (at least to the other participants in the discussion) context. In this particular case, the context should be even more obvious because it's so clear that's what the article is talking about.
It doesn't matter how knowledgeable and precise the people you're talking with are; you still need to communicate clearly about what you're actually talking about.
In my opinion, your response is tautological and not related to the point I made that existing AI is good enough to start building applications and functionality.
It was good enough in 2015/2016 for me to run a startup that allowed people to program in natural language. We even had paying clients though eventually none could stomach the $2000 per month for incremental/on-line training costs.
The only real difference between then and now is that OpenAI's models are significantly better than my models from 2015, and they have that because well, they can afford to pile on more data. TBH, I never even considered using a large proportion of the whole internet as a training set as even remotely possible due to the sheer mind boggling costs.
Even now, to go through about 10% of The Pile would cost me way too much money.
And it's not really cost effective - as well as being an epicenter of culture wars. "Your AI is woke! Your AI is fascist!".
THIS part of the AI sector is just a giant pyramid scheme - impress the investors so they shovel trillions your way. That's not exactly new in Silicon Valley - keep the valuation of a hot potato going up until someone is left holding the bag.
AI's most useful applications are not being a generalist.