Hacker News new | ask | show | jobs
by Trapais 879 days ago
>Are there any true open-source LLM models, where all the training data is publicly-available (with a compatible license)

Mamba has a version, trained on publicly available SlimPajama. RedPajama-INCITE was trained on non-slimmed version of the dataset(it's only one dataset).

I'm not sure if training scripts are available.

Pythia definitely has scripts. However it was trained on the pile, so you have to find books3 on your own.

Also I believe LLM360 is an explicit attempt to do it with llama.

>Is training nondeterministic?

Correct. Torch documentation has a section on reproducibility of a training.