| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andai 73 days ago
	Is there a benchmark for these long tasks? That kind of seems like the only number worth measuring. (Of course at that point it involves memory and context management and so on, so you're testing the harness as well as the model.)