Hacker News new | ask | show | jobs
by maciejgryka 954 days ago
I have similar conclusions so far. We have a custom data set (basically visual Q&A about web apps) and `gpt4` gets roughly 90% correct, while `gpt-4-1106-preview` only 86%. It's a little noisy (I didn't yet check out the new seeds functionality), but roughly consistent.

Since I created this dataset by hand, it can't really be memorized. I'm sure there's _similar_ data in the training set, but answering correctly still requires some reasoning-like capabilities.