|
|
|
|
|
by tqi
890 days ago
|
|
"the project does not benefit from the OSS feedback loop" It's not like you can submit PRs to training data that fixes specific issues the way you can submit bug fixes, so I'm skeptical you would see much of a feedback loop. "it’s hard to verify that the model has no backdoors (eg sleeper agents)" Again given the size of the datasets and the opaque way training works, I am skeptical that anyone would be able tell if there is a backdoor in the training data. "impossible to verify the data and content filter and whether they match your company policy" I don't totally know what this means. For one, you can/probably should apply company policies to the model outputs, which you can do without access to training data. Is the idea that every company could/should filter input data and train their own models? "you are dependent on the company to refresh the model" At the current cost, this is probably already true for most people. "A true open-source LLM project — where everything is open from the codebase to the data pipeline — could unlock a lot of value, creativity, and improve security." I am overall skeptical that this is true in the case of LLMs. If anything, I think this creates a larger surface for bad actors to attack. |
|
>I am overall skeptical that this is true in the case of LLMs
This skepticism seems reasonable. EleutherAI have documentation to reproduce training (https://github.com/EleutherAI/pythia#reproducing-training). So far I haven't seen it leading to anything. Lots of arxiv papers I've seen complain about time and budget constraint even regarding finetunes, forget pretraining.