Hacker News new | ask | show | jobs
by tastroder 996 days ago
This is from 21, not really news, and the paper version on arxiv and published at NeurIPS have quite a few citations. No one's suppressing this, people that don't reflect on their datasets or how they use them just either don't care or fail to acknowledge they're actual issues.
1 comments

IANA AI developer but have been looking into this in detail recently for other purposes. I was puzzled at the lack of info about "books" and when searching for detail (in what I believe was a reasonably diligent manner) found a very surprisingly small amount of it. I assumed there would be more knowledge and did ask for it here. So now I will go look up those papers to get a better sense of things. Thank you for the tip.

I note neither this paper nor any discussion of "BookCorpus" or even "book corpus" has appeared on HN previously.

Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus, 2021, Jack Bandy and Nicholas Vincent

https://arxiv.org/pdf/2105.05241.pdf?