Hacker News new | ask | show | jobs
by leetharris 643 days ago
There's plenty of open source AI out there that isn't Meta. It's just not as good.

The #1 problem is not compute, but data and the manpower required to clean that data up.

The main thing you can do is support companies and groups who are releasing open source models. They are usually using their own data.

4 comments

> There's plenty of open source AI out there that isn't Meta. It's just not as good.

To my knowledge all of the notable open source models are subsidised by corporations in one way or another, whether by being the side project of a mega-corp which can absorb the loss (Meta) or coasting on investor hype (Mistral, Stability). Neither of those give me much confidence that they will continue forever, especially the latter category which will just run out of money eventually.

For open source AI to actually be sustainable it needs to stand on its own, which will likely require orders of magnitude more efficient training, and even then the data cleaning and RLHF are a huge money sink.

if you can do 100x more efficient training with open source, closeAI can simply take that and train a model that's 100x bigger/longer/more tokens.
AKA why Unsloth is now YC backed for their even better (but closed source) fine-tuning.
https://huggingface.co/datasets/HuggingFaceFW/fineweb

The #1 problem is absolutely compute. People barely get funding for fine tunes, and even if you physically buy the GPUs it'll cost you in power consumption.

That said, good data is definitely the #2 problem. But nowadays you can just get good synthetic datasets from calling closed model APIs or just using existing local LLMs to sift through trash. That'll cost you too.

>The main thing you can do is support companies and groups who are releasing open source models. They are usually using their own data.

Alternatively we could create standardized open source training data like wikipedia, wikimedia as well as public domain literature and open courseware. I'm sure that there are many other such free and legal sources of data.

but the training data is one of the key bits that makes or breaks your model's performance.

There is a reason why datasets are private and the model weights aren't.

Compute is for sure the number one problem. Look at how long it’s taking for anything better than Pony Diffusion to come out for NSFW image gen despite the insane amount of demand for it.

Look at how much computer purple AI actually has. It’s basically nothing.