Hacker News new | ask | show | jobs
by leod 2607 days ago
Thank you so much for your comprehensive answer, this helps a lot.

If I understand nshepperd's code correctly, it uses a constant and small learning rate. Do you know if this works better than the learning rate schedule that is usually used for Transformer models (https://www.tensorflow.org/alpha/tutorials/text/transformer_...)?

1 comments

It's a constant, yes. We haven't tried any other learning rate schedules (for my poetry GPT-2s, I simply drop the LR 10x each day or so). I have no idea if this is optimal for transfer learning or not.