| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gwern 1828 days ago
	It's worth noting that GPT-f already gets a big performance boost from pretraining on Arxiv etc (https://arxiv.org/pdf/2009.03393.pdf#page=7) despite those sources containing next to no Metamath or anything that looks like a raw Metamath proof, just regular natural language & LaTeX discussing math...

1 comments

astrange 1828 days ago

How well does text extraction from a PDF work? I almost never try it but thought there were random spaces in the output and such things.

link

hangsi 1828 days ago

A fair summary would be "often very well, but not always". A good exmaple would be the S2ORC dataset [0]: a dataset of full parses of scientific PDFs. In their paper, the authors write about the difficulties of getting the parsers to work reliably, and how having multiple published versions of a PDF was helpful for when the PDF parser fails on the first one.

[0] https://allenai.org/data/s2orc

link

moyix 1827 days ago

It's worth noting that for most papers, arXiv provides the LaTeX source for download, which is presumably what they trained on.

link