|
|
|
|
|
by nightski
1039 days ago
|
|
It sounds nice in theory, but the data itself could be problematic. There is no temporal nature to it. You can have duplicate data points, many data points that are closely related but describe the same thing/event/etc.. So while only showing the model each data point once ensures you do not introduce any extra weight on a data point, if the dataset itself is skewed it doesn't help you at all. Just by trying to make the dataset diverse you could skew things to not reflect reality. I just don't think enough attention has been paid to the data, and too much the model. But I could be very wrong. There is a natural temporality to the data humans receive. You can't relive the same moment twice. That said, human intelligence is on a scale too and may be affected in the same way. |
|
I wholly agree. Everyone is blinded by models - GPT4 this, LLaMA2 that - but the real source of the smarts is in the dataset. Why would any model, no matter how its architecture is tweaked, learn about the same ability from the same data? Why would humans be all able to learn the same skills when every brain is quite different. It was the data, not the model
And since we are exhausting all the available quality text online we need to start engineering new data with LLMs and validation systems. AIs need to introspect more into their training sets, not just train to reproduce them, but analyse, summarise and comment on them. We reflect on our information, AIs should do more reflection before learning.
More fundamentally, how are AIs going to evolve past human level unless they make their own data or they collect data from external systems?