Hacker News new | ask | show | jobs
by bra-ket 3219 days ago
does it mean we don't need gradient descent after all to achieve the same result?
1 comments

Nope, Transformer is still trained with gradient descent.