|
|
|
|
|
by kir-gadjello
1138 days ago
|
|
Impressive model, thank you for releasing it under a business-friendly license! Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product. Here is the paper https://arxiv.org/abs/2111.12763
and the implementation https://github.com/google/trax/blob/master/trax/models/resea... if you are interested. Hope you get to look into this! |
|
Like why did we even get excited? This? Great work.