| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fc417fc802 125 days ago
	Given that models don't currently learn as they go isn't that exactly what this benchmark is testing? If the model needs to either have been explicitly trained in a similar environment or else to have a human manually input a carefully crafted prompt then it isn't general. The latter case is a human tuning a powerful tool. If it can add the necessary bits to its own prompt while working on the benchmark then it's generalizing.