| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by maciejgryka 954 days ago
	I have similar conclusions so far. We have a custom data set (basically visual Q&A about web apps) and `gpt4` gets roughly 90% correct, while `gpt-4-1106-preview` only 86%. It's a little noisy (I didn't yet check out the new seeds functionality), but roughly consistent. Since I created this dataset by hand, it can't really be memorized. I'm sure there's _similar_ data in the training set, but answering correctly still requires some reasoning-like capabilities.