Hacker News new | ask | show | jobs
by lrei 3468 days ago
> about machine learning: how hard it is to actually trust results

I find the opposite true: code is easy to replicate and the datasets for algorithm comparison are open (e.g. imdb used in the PV paper). If you show very good results (especially with a simple approach such as PVs) people will immediately implement your algorithm and if their results don't match your published results, it will be known. PS: I implemented PVs shortly after it was published - though I don't care so much for the 1-3% or wtv accuracy discrepancy on the imdb dataset, the idea is great.

> Graduate students almost never write tests for their code

1) I doubt a standard software test would've helped here (probably cross-val would've caught it); 2) Who writes tests for experiment code? 3) The graduate student story is concerning: either a) someone doing a lot of the heavy lifting for the paper w/o being credited or b) this someone doesn't exist

1 comments

Code is only easy to replicate when they give you or publish the code. This is not true of many ML papers. In the words of the second author, a 3% accuracy difference on this particular dataset is a "huge difference."

In fact, dismissing a 3% difference is actually reflective again of how delicate understanding ML results is. A jump from 90% accuracy to 93% accuracy is massively different than a jump from 50 to 53% or even a jump from 80 to 83%.

Almost nobody writes tests for experiment code. You're proving my point :)

> Code is only easy to replicate when they give you or publish the code

No. Graduate ML students can implement the papers they read w/o a reference implementation - just search github. As I said, I implemented PV w/o the reference code. Many others did the same even before I did.

> dismissing a 3% difference is actually reflective again of how delicate understanding ML results

Not really. I understand very well results in ML (Otherwise I would be a pretty incompetent graduate student). But does a 3% increase on say imdb translate to an increase on a another text classification task? possibly - but usually not. If it does translate well across text classification datasets, you will almost certainly see the different datasets and the results in the paper.

> Almost nobody writes tests for experiment code. You're proving my point :)

It's a good point but in my experience, the kinds of mistakes that I've usually found with my own or others experimental code would not be possible to catch with a software test. Only with analysis of the results do they become obvious.