| You’re doing author name disambiguation (https://en.wikipedia.org/wiki/Author_name_disambiguation). That’s a difficult, messy problem. For example, there’s https://revstat.ine.pt/index.php/REVSTAT/article/view/382, with two authors with the same first and surname working in the same institute and field. Try separating them without ORCID. Such problems aren’t rare if you have papers that only mention initials, certainly not with Chinese or Korean names, as those are countries where name clashes are a lot more common than in the ‘west’ (over 1:5 of South Koreans have the surname 김 (Kim), 1:7 이 (Lee), according to https://en.wikipedia.org/wiki/List_of_Korean_surnames. China is slightly less bad (https://en.wikipedia.org/wiki/List_of_common_Chinese_surname...), but compensates by having a much larger population) I can’t find it, but remember seeing a paper with 4 or 5 authors named “Kim” with the same initials. That author name disambiguation Wikipedia page links to https://github.com/neozhangthe1/disambiguation. I don’t know how good it is, but you should consider it. And as others have said, you should use ORCID, if available. You should also use email address (often included in article metadata at least for the corresponding author), but can assume neither that every author has a single email address nor that a single email address belongs to a single person. Another case to worry about is that names can change, for example because of use of a different romanization (https://en.wikipedia.org/wiki/Romanization), marriage, or gender change. |
Also many thanks making me aware of ORCID (together with "DamonHD"). After doing some digging I found ORCID API endpoints which are able to resolve DOIs and PMIDs to ORCID profiles, which is great.
I will take a closer look at the "disambiguation" project you linked to and will see what approach I can take for reliably resolving email addresses to first names (while filtering out non-personal email addresses).
That being said, I fear that resolving non-ORCID + non-email authors using the SerpApi + LLM approach I described in my initial post is still the best shot I currently have.