Hacker News new | ask | show | jobs
by jpalomaki 1 day ago
Isn't training material the biggest problem for truly open source LLMs (such that could compete with top tier models)? The computation part can be solved with money, but compiling a comprehensive training set that could be freely shared and free of copyright issues is pretty much impossible.
3 comments

I wonder if we could gamify and democratise it somehow, like fold-at-home and wikipedia...

I've been training a teeny specialised model to run in a browser on a phone to detect harmonium notes played in a song (harmonium turns out is a pita, another story for another day), getting good labelled data is _all_ of the hard work.

That being said, maybe for cheap inference, using a big model to train something ultra-suited for the task at hand might be how we could handle local inference; thinking language specific models.

You don't need to have fully copyright-unencumbered datasets to build Open Source AI, as that (as you say) would be impossible. https://opensource.org/ai
Didn't the courts decide that if it's just for learning everything is fair game?