| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wongarsu 19 days ago
	A major limitation is that they only test GPT 4o. Previous research like [1] investigating the same question has shown significant differences between models, and even depending on the language of your prompt 1: https://aclanthology.org/2024.sicon-1.2.pdf