Hacker News new | ask | show | jobs
by fergie 22 days ago
> The entire AI industry has been (not so secretly) paying a lot of experts in many fields to generate large amounts of novel training data. Novel training data that isn't found anywhere else--they hoard it--and which could actually contain original ideas.

Really? Any references to read more?

1 comments

  - https://www.theverge.com/cs/features/831818/ai-mercor-handshake-scale-surge-staffing-companies
  - https://outlier.ai/math/en-us
  - https://www.opentrain.ai/
  - https://www.pin.com/blog/ai-labs-hiring-train-models/
Much of this is data annotation, reasoning trace evaluation, and problem set curation. But there is no way they haven't atleast paid some mathematicians to work on research grade problems in tandem with their models, and then used that for training data.

Does this expert data likely contain this proof within it? No. Would it temper the impressiveness to know they have a large amount of novel mathematical training data, an internal Lean harness for evaluation of open conjectures, and spent hundreds of millions in compute to calculate this? Yes.

At one point you just have to look at a mirror and ask, what would impress you.

Also why pay anyone, when they can keep up with all the papers that not one man can read them all? That seems to me like wasted money.

Another point is that that's not how AI training works anyway. It's much easier to put it in context rather than re-train them with every bit of random maths you find out. Things at the tail-end of the power law doesn't stick. At least, last I checked...

The opening line of my parent comment is: "This is impressive, no question." I am impressed, and the chain you are replying to is questioning how much that impression should be tempered.

They pay people for expert training data they do not share because it gives them an edge over other AI companies. And, as always, deep learning is enormously data-hungry, and we've gotten to the point where publicly available data has been exhausted.

AI companies absolutely retrain models regularly to keep up with the cutting edge. There is a reason why this announcement references an internal, unreleased model, rather than "we just put a lot of new math papers within the GPT5.5 context window and found this."