Hacker News new | ask | show | jobs
by mufasachan 945 days ago
> Although the author OCR’ed the SAT questions and believes that they weren’t in the training data

I agree that the author of the tweet fairly underestimates the potential portion of OCR'ed contents in OpenAI's training data. In late August, Nougat[1] is released by Meta, this is an OCR model. Its performance are wild and the model is open source.

I hardly believe that OpenAI does not spend effort on getting more training from OCR content. I also hardly believes that OpenAI waits for a Meta paper to have an internal performant OCR model.

[1]: https://arxiv.org/abs/2308.13418