| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by leod 2607 days ago
	Thank you so much for your comprehensive answer, this helps a lot. If I understand nshepperd's code correctly, it uses a constant and small learning rate. Do you know if this works better than the learning rate schedule that is usually used for Transformer models (https://www.tensorflow.org/alpha/tutorials/text/transformer_...)?

1 comments

gwern 2607 days ago

It's a constant, yes. We haven't tried any other learning rate schedules (for my poetry GPT-2s, I simply drop the LR 10x each day or so). I have no idea if this is optimal for transfer learning or not.

link