| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by euclaise 1119 days ago
	Training as GPT vs RNN will give you numerically identical results with RWKV, it's just two ways of computing the same thing. It's trained in GPT-mode because it's cheaper to train that way -- you can parallelize over the sequence length. In practice it isn't going to be any different than training with back-propagation through time for the same sequence length.