Hacker News new | ask | show | jobs
by albertan017 829 days ago
Thanks! Training a language model from scratch is data-intensive; Llama2 was developed using 2 trillion tokens, while our dataset is around 4 billion.

The appropriate size of the model is not straightforward to determine. In our experiments, a 7 billion parameter model achieved 21% executability compared to just 10% for a 1 billion parameter model. However, their re-compilability rates are quite similar.

To run a 1 billion parameter model, a minimum of 2GB GPU memory is necessary, which is feasible on most GPUs. A 7 billion parameter model needs 14GB, suitable for GPUs like the 3090/4090 series. For running a 33 billion parameter model, an A100 GPU (80G) would be the single card option, although technically a MacBook could work, but you won't really want to use it.