Hacker News new | ask | show | jobs
by Someone 914 days ago
You’re doing author name disambiguation (https://en.wikipedia.org/wiki/Author_name_disambiguation). That’s a difficult, messy problem.

For example, there’s https://revstat.ine.pt/index.php/REVSTAT/article/view/382, with two authors with the same first and surname working in the same institute and field. Try separating them without ORCID.

Such problems aren’t rare if you have papers that only mention initials, certainly not with Chinese or Korean names, as those are countries where name clashes are a lot more common than in the ‘west’ (over 1:5 of South Koreans have the surname 김 (Kim), 1:7 이 (Lee), according to https://en.wikipedia.org/wiki/List_of_Korean_surnames. China is slightly less bad (https://en.wikipedia.org/wiki/List_of_common_Chinese_surname...), but compensates by having a much larger population)

I can’t find it, but remember seeing a paper with 4 or 5 authors named “Kim” with the same initials.

That author name disambiguation Wikipedia page links to https://github.com/neozhangthe1/disambiguation. I don’t know how good it is, but you should consider it.

And as others have said, you should use ORCID, if available. You should also use email address (often included in article metadata at least for the corresponding author), but can assume neither that every author has a single email address nor that a single email address belongs to a single person.

Another case to worry about is that names can change, for example because of use of a different romanization (https://en.wikipedia.org/wiki/Romanization), marriage, or gender change.

1 comments

Thanks a lot for the detailed response! I was indeed not aware of the scope of the problem and the fact that it has its own Wikipedia page.

Also many thanks making me aware of ORCID (together with "DamonHD"). After doing some digging I found ORCID API endpoints which are able to resolve DOIs and PMIDs to ORCID profiles, which is great.

I will take a closer look at the "disambiguation" project you linked to and will see what approach I can take for reliably resolving email addresses to first names (while filtering out non-personal email addresses).

That being said, I fear that resolving non-ORCID + non-email authors using the SerpApi + LLM approach I described in my initial post is still the best shot I currently have.

Even ORCID is not foolproof, there's been cases where it misattributes papers. For a proper job you'd need to look at actual journal article text and cross-reference possible authors' CVs if easily available. It's not something that can be seamlessly automated.
Interesting, thanks for the heads up. I will implement the following safeguard:

* Known: Paper DOI, paper PMID, author last name, author initials

* Get connected ORCIDs based on the DOI / PMIDs

* Check if the known last name matches the last name of the ORCID profile (also include the "Also known as" section of the profile)

This may lead to some false negatives (for example in case of name changes that were not properly recorded) but if I can reduce the amount of manual lookups to a number below 100, it's already a win.