This is a small transformer trained from scratch in 1.5hrs on a 5090 that beats many LLMs. Code is open source.
I want to solve sample efficiency and this work is an attempt to find the limits of transformers and today's methods while keeping costs low so I can iterate fast.