Maybe it's just way too small, you wouldn't use Karatsuba multiplication to do 3*5.
I'm not using a transformer, just a plain Feedforward, Relu and dropout for a simple classifier.
I don't know, I can be wrong. I hope and some toy experiment shows that even in low case parameters it works fine as well as adam.
I'm not using a transformer, just a plain Feedforward, Relu and dropout for a simple classifier.
I don't know, I can be wrong. I hope and some toy experiment shows that even in low case parameters it works fine as well as adam.