| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zarzavat 1158 days ago
	After the low hanging fruit - the high quality data such as scientific papers, libgen, stackexchange, wikipedia, etc — has been exhausted, that’s it. There’s no more data of that kind. There’s not 9 other wikipedias or 9 other libgens. There is only a certain quantity of high-quality codified knowledge in existence and models need to be able to deal with that constraint. Feeding it more and more lower quality text is not going to improve performance because we already fed it all the text that we use. There’s a reason that PhDs don’t involve reading tumblr all day.

2 comments

PoignardAzur 1157 days ago

> There’s not 9 other wikipedias

By the way, I wonder how much you could get from "history" data: wikipedia history pages, talk pages, commits diffs on github, pull request discussions, etc.

AFAIK so far we've only been using the finished code "artifacts", but if we're desperate for more tokens to train on, we might get a lot of mileage from just "all different versions of this dataset over time".

link

visarga 1158 days ago

There's a reason there are so many review papers - which are just synthesis of a topic in a certain period of time. Second order analysis is useful content, not junk. It can cross reference facts and detect inconsistencies. Combining multiple sources can lead to new insights and learning the trends.

link