|
|
|
|
|
by Eisenstein
828 days ago
|
|
But you are specifically talking about one type of AI, which is a generative language model. There are tons of other AIs with different applications that do not need to be trained on the entire internet. You have computer vision which separates in object recognition, classification, OCR, etc; you have audio which has text-to-speech (and reverse), music generation, and all sorts of other things; machine translation; sentiment analysis (I won't list all the categories in hugging face but you get my point). These are not differentiated merely by 'training data' to my understanding, so that's why your comment didn't make sense to me. Calling all AI LLMs is like calling all of the internet the web. Of course if I am mistaken, corrections are welcome. |
|
Take computer vision for example - a "hello world" version of object recognition would use ImageNet, which is 14 million hand annotated images. Or Cifar10 which is 80 million images. That of course but sets the stage for training data differentiation. Google's image recognition algorithm is far superior to other search engines'. Why? Because of Google's data set.
Any Tom Dick and Harry can go create their own image recognition AI and train it based on all the public datasets (COCO, CIFAR, ImageNet) but that's considered pretty baseline nowadays. The differentiator is what _other_ datasets you have.
Different datasets yield different results. It doesn't matter the network. More data is better (usually).