Hacker News new | ask | show | jobs
by oliverluk 913 days ago
Thanks a lot for the detailed response! I was indeed not aware of the scope of the problem and the fact that it has its own Wikipedia page.

Also many thanks making me aware of ORCID (together with "DamonHD"). After doing some digging I found ORCID API endpoints which are able to resolve DOIs and PMIDs to ORCID profiles, which is great.

I will take a closer look at the "disambiguation" project you linked to and will see what approach I can take for reliably resolving email addresses to first names (while filtering out non-personal email addresses).

That being said, I fear that resolving non-ORCID + non-email authors using the SerpApi + LLM approach I described in my initial post is still the best shot I currently have.

1 comments

Even ORCID is not foolproof, there's been cases where it misattributes papers. For a proper job you'd need to look at actual journal article text and cross-reference possible authors' CVs if easily available. It's not something that can be seamlessly automated.
Interesting, thanks for the heads up. I will implement the following safeguard:

* Known: Paper DOI, paper PMID, author last name, author initials

* Get connected ORCIDs based on the DOI / PMIDs

* Check if the known last name matches the last name of the ORCID profile (also include the "Also known as" section of the profile)

This may lead to some false negatives (for example in case of name changes that were not properly recorded) but if I can reduce the amount of manual lookups to a number below 100, it's already a win.