Hacker News new | ask | show | jobs
by moyix 1944 days ago
Yep, I trained my own BPE using HuggingFace's tokenizer library. During training I didn't keep the entire dataset in memory because even on an RTX8000 the full dataset + model weights + data used by the optimizer (ADAM) is too big.