Hacker News new | ask | show | jobs
by pama 987 days ago
In the context of this article, the small/tiny models were 1--30 million parameters and the large model was 1.5 billion parameters. Efficient training of a 7-billion parameter model already requires algorithms for training across multiple GPU because the memory requirement for derivatives and optimizers will not fit on the typical 80G Ram of the current high end GPUs.