Hacker News new | ask | show | jobs
by bnewbold 1493 days ago
It is true that some might need to be done manually, but Google Scholar shows that it can be done, with some level of accuracy, via HTML and PDF scraping. PIDs and more formalized metadata make things much easier. But Google Scholar did result in pressure on platforms/publishers/repositories to put at least minimal metadata in HTML meta tags, and this can be machine-extracted. And there is a ton of content and metadata available via OAI-PMH. Neither of these technologies cost anything to publishers on the margin, once they get them implemented, and many have to reap the discovery benefits of large search indices.