Hacker News new | ask | show | jobs
by dlkf 2532 days ago
> No one will be allowed to read or download work from the repository, because that would breach publishers’ copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world’s scientific literature to pull out insights without actually reading the text.

I find this totally unconvincing. The average scientific article isn't any good, and the NLP algorithms that do tasks like this are even worse.

For scientific literature to be useful to those of us outside the academy, we need to be able to see the full document - what methodology was employed, what assumptions were made, how the data was gathered - just to be able to gauge whether the authors had any idea what they were doing. Ideally we would also be able to search over documents, explore the citation tree in some sort of UI, and access articles in multiple formats (sometimes you want a PDF, sometimes you want plaintext, and I imagine that the mathy-types might like LaTeX source).

I applaud Malamud's efforts to overcome this problem, and I hope his results prove me wrong. I just think it's sad that we have to resort to hacks like this to overcome what is obviously enormous scam that is stealing our tax dollars and stifling academic and economic creativity.

3 comments

> The average scientific article isn't any good, and the NLP algorithms that do tasks like this are even worse.

In my day job, I'm often tasked with implementing algorithms from recently published physics papers. In order to do so, I normally have to read through at least 10 related papers (both cited papers and cited by papers) in order to have a clear idea in my head about what is going on. Even then, I often have to discuss what I have read with 2 or 3 other people and try out many different approaches before we finally understand what the paper meant.

This is because few papers include enough details to reproduce what they are doing. Many of these papers have mistakes or typos in their equations. Many of the math equations are also under specified. For example, an equation may list a sum over an index, but then in the text there may be a whole paragraph that describes what that index means (there is nothing wrong with this, but it would make it hard for AI to "just use the equations").

At the end of this whole process, which can take a week, about half of the time we choose not to implement the algorithm for one of several reasons:

* The authors misrepresented their work and it does not perform as well as claimed (e.g. the chosen examples are special cases that make their approach look better than the state-of-the-art approach)

* The work is not reproducible from the information in the paper

* The amount of work to implement it is far greater than an initial reading of the paper would suggest, due to additional details that were left out of their discussion

So, all of this is to conclude that understanding scientific papers is at the very limit of human ability for a group of PhDs in the field. I do not think that until we have much more powerful AI that it has any hope of making sense of this mess.

edit: P.S. I am guilty of these same mistakes when publishing. I understand the deadlines and pressure to publish that leads to these issues. It is a huge amount of work to fully document and publish all the details needed to reproduce some new algorithm .

This sums up my experience pretty well. I used to work in a field of physics where a lot of projects were not open-sourced, which is probably your experience too. Now, I work in computer vision, where there is a lot of open sourced work and I _still_ find myself coming up against most of these problems when dealing with implementations of cutting edge results.
>In my day job, I'm often tasked with implementing algorithms from recently published physics papers. In order to do so, I normally have to read through at least 10 related papers (both cited papers and cited by papers) in order to have a clear idea in my head about what is going on. Even then, I often have to discuss what I have read with 2 or 3 other people and try out many different approaches before we finally understand what the paper meant

This post makes me feel slightly better about how often I come away from reading a paper feeling like I only have a shallow understanding of the content.

Yes. I worked with a company in India looking to mine biology papers with information on genes and automatically populate a database from that. Given the various ways in which people wrote (required to pass anti-plagiarism checks as well as different writing styles) it turned out that any kind of automated annotation was rife with errors. Given that it was supposed to be used for developing drugs they dropped it in favor of hiring Master students part time at $200/month and annotating manually.
Great anecdote. I'm increasingly of the opinion that "pay some humans to do it" is the most underrated data product engineering pattern out there. It's well known that FAANG invest heavily in human annotators (it's not a coincidence that Mechanical Turk was developed at Amazon) and it's unclear why anyone else building data products shouldn't have to.
Yeah, this is silly. Elsevier and the like should be burned down for crimes against humanity. I mean, it generally is illegal to price gouge during famines.
Elsevier is a useless profiteer, but let's please remember that scientists are voluntarily submitting their articles to Elsevier journals, and the funding agencies (and universities) are doing nothing to stop them. I find it a little frustrating to read the wailing from institutions like UC (where I used to work) about subscription prices when they are part of the problem in the first place.
Sure. It's the prisoners' dilemma.

But that doesn't exonerate the prisons.