| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sasjaws 109 days ago
	The idea i tried to express was purely the loss function thing you mentioned, and how both tasks (1 vs 2 vs n) lead to identical training runs. At least with nanogpt. I dont know if that extrapolates well to current llm internals and current training.