Hacker News new | ask | show | jobs
by jasonpriem 1510 days ago
Agreed, the author disambiguation isn't quiet as good as Scopus'...they have a bit of a head start on us. But we're improving it quickly.

Thanks for the suggestion about the data dump. A lot of that weight is abstracts, which come in at over 30GB just by themselves. But it's true that the JSON format has some redundancies. For now we think those are worth it, because the denormalized schema is very compatible with the API and easy for beginners to get started with. Plus you only have to download it once (for free! HT to AWS Open Data sponsorship), and after that the updates are very light.

We'll certainly consider offering a smaller, normalized format in the future though, if we get more requests for it.

1 comments

Author disambiguation can be a very hard problem in practice, especially with authors who share the same names (a relatively common occurrence in some non-Western countries) and work on similar things. It's even somewhat common to see the exact same ORCid (supposedly a unique identifier) attached to what's clearly papers published by distinct individuals. At some point, one pretty much has to guess.
Yes, I think some services attempt to retro-assign ORCIDs. I usually have to try to grab the affiliation and date to make sure I have the right person. Someday for the authors I follow I want to train a model to give me a match score with my Smiths and Lees etc.