| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jpalomaki 48 days ago
	Isn't training material the biggest problem for truly open source LLMs (such that could compete with top tier models)? The computation part can be solved with money, but compiling a comprehensive training set that could be freely shared and free of copyright issues is pretty much impossible.

3 comments

ajdegol 48 days ago

I wonder if we could gamify and democratise it somehow, like fold-at-home and wikipedia...

I've been training a teeny specialised model to run in a browser on a phone to detect harmonium notes played in a song (harmonium turns out is a pita, another story for another day), getting good labelled data is _all_ of the hard work.

That being said, maybe for cheap inference, using a big model to train something ultra-suited for the task at hand might be how we could handle local inference; thinking language specific models.

reedciccio 48 days ago

You don't need to have fully copyright-unencumbered datasets to build Open Source AI, as that (as you say) would be impossible. https://opensource.org/ai

dorfsmay 48 days ago

Didn't the courts decide that if it's just for learning everything is fair game?