|
|
|
|
|
by omneity
294 days ago
|
|
The first thing that comes to mind when reading “custom tokenizer” and “slice off the embedding layers” is that this sounds very much like pre-training from scratch, for which 2GB is far from enough. Assuming you do get the data though, for a model at the sizes you’re evaluating you’re looking at weeks on a Colab A100-40GB most likely. My recommendation would be to approach this with a smaller model and with a different training method that doesn’t involve a new tokenizer or new embedding layers because that’s what’s causing the cost x time to balloon beyond feasibility. |
|