Hacker News new | ask | show | jobs
by gwern 1828 days ago
It's worth noting that GPT-f already gets a big performance boost from pretraining on Arxiv etc (https://arxiv.org/pdf/2009.03393.pdf#page=7) despite those sources containing next to no Metamath or anything that looks like a raw Metamath proof, just regular natural language & LaTeX discussing math...
1 comments

How well does text extraction from a PDF work? I almost never try it but thought there were random spaces in the output and such things.
A fair summary would be "often very well, but not always". A good exmaple would be the S2ORC dataset [0]: a dataset of full parses of scientific PDFs. In their paper, the authors write about the difficulties of getting the parsers to work reliably, and how having multiple published versions of a PDF was helpful for when the PDF parser fails on the first one.

[0] https://allenai.org/data/s2orc

It's worth noting that for most papers, arXiv provides the LaTeX source for download, which is presumably what they trained on.