| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lrei 3516 days ago

> about machine learning: how hard it is to actually trust results

I find the opposite true: code is easy to replicate and the datasets for algorithm comparison are open (e.g. imdb used in the PV paper). If you show very good results (especially with a simple approach such as PVs) people will immediately implement your algorithm and if their results don't match your published results, it will be known. PS: I implemented PVs shortly after it was published - though I don't care so much for the 1-3% or wtv accuracy discrepancy on the imdb dataset, the idea is great.

> Graduate students almost never write tests for their code

1) I doubt a standard software test would've helped here (probably cross-val would've caught it); 2) Who writes tests for experiment code? 3) The graduate student story is concerning: either a) someone doing a lot of the heavy lifting for the paper w/o being credited or b) this someone doesn't exist

1 comments

argonaut 3516 days ago

Code is only easy to replicate when they give you or publish the code. This is not true of many ML papers. In the words of the second author, a 3% accuracy difference on this particular dataset is a "huge difference."

In fact, dismissing a 3% difference is actually reflective again of how delicate understanding ML results is. A jump from 90% accuracy to 93% accuracy is massively different than a jump from 50 to 53% or even a jump from 80 to 83%.

Almost nobody writes tests for experiment code. You're proving my point :)

lrei 3516 days ago

> Code is only easy to replicate when they give you or publish the code

No. Graduate ML students can implement the papers they read w/o a reference implementation - just search github. As I said, I implemented PV w/o the reference code. Many others did the same even before I did.

> dismissing a 3% difference is actually reflective again of how delicate understanding ML results

Not really. I understand very well results in ML (Otherwise I would be a pretty incompetent graduate student). But does a 3% increase on say imdb translate to an increase on a another text classification task? possibly - but usually not. If it does translate well across text classification datasets, you will almost certainly see the different datasets and the results in the paper.

> Almost nobody writes tests for experiment code. You're proving my point :)

It's a good point but in my experience, the kinds of mistakes that I've usually found with my own or others experimental code would not be possible to catch with a software test. Only with analysis of the results do they become obvious.