Hacker News new | ask | show | jobs
by refibrillator 656 days ago
Hmm there may be a bug in the authors’ python script that searches google scholar for the phrases "as of my last knowledge update" or "I don't have access to real-time data". You can see the code in appendix B.

The bug happens if the ‘bib’ key doesn’t exist in the api response. That leads to the urls array having more rows than the paper_data array. So the columns could become mismatched in the final data frame. It seems they made a third array called flag which could be used to detect and remove the bad results, but it’s not used any where in the posted code.

Not clear to me how this would affect their analysis, it does seem like something they would catch when manually reviewing the papers. But perhaps the bibliographic data wasn’t reviewed and only used to calculate the summary stats etc.

4 comments

That sounds important enough to contact the authors. Best case, they fixed it up manually; worst case, lots of papers are publicly accused of being made up and the whole farming/fish-focused summary they produced is completely wrong.
Hi there! My name is Kristofer, one of the authors of this research note. I also wrote the script. We were notified via email about this comment. Please see below for our response. Thank you for your interest in our research! (I'm removing the sender's name to respect their privacy)

""" Dear XXXX,

MY name is Kristofer, I’m one of the co-authors for the GPT paper. I also wrote the script for the data collection. Jutta forwarded your email regarding the possible bug.

First of all, let me apologise for the late response. Apparently your email made its way to the spam folder, which of course is regrettable. I would also like to thank you for reaching out to us. We are pleased to see the interest of the HN community in transparent and reliable research.

We looked at the comment and the concern around the bug. We’d like to point out that the original commenter was right in saying “it does seem like something they would catch when manually reviewing the papers”. We in fact reviewed the output manually and carefully for any potential errors. In other words, we opened and searched for the query string manually, which also helped determine whether the use of LLMs was declared in some form or other. This is of course a sensitive topic and we took great care to be thorough.

Nevertheless, we once more did a manual review of the code and the data, in light of this potential bug, and we’re glad to say no row-column mismatch is present. You can find the data here: https://doi.org/10.7910/DVN/WUVD8X

Please don’t hesitate if you have any more questions.

All the best, Kristofer """

As a tangent to the paper topic itself, what should be the standard procedure for publishing data gathering code like this? Given that they don't specify which version of any libraries or APIs used and that updates occur over time, API's change etc. inevitably resulting in code rot. It will eventually be impossible to figure out exactly what this code did.

With meticulous version records it should at least be possible to ascertain what the code did by reconstructing that exact version (assuming stored back versions exist)

In my opinion, archive the data that was actually gathered and the code's intermediate & final outputs. Write the code clearly enough that what it did can be understood by reading it alone, since with pervasive software churn it won't be runnable as-is forever. As a bonus, this approach works even when some steps are manual processes.
Using a colab with printed outputs could be a good option to at the very least hint to reproducing results independently