Hacker News new | ask | show | jobs
by 2bitencryption 1640 days ago
Question -

I get that your run-of-the-mill paper saying "Here we present a novel algorithm for xyz" will usually have the algorithm defined in simple psuedo-code, maybe with an implementation in a "real" language as a proof of concept.

But for the many papers describing novel ML models, how does that work? They seem to use images that diagram out the different layers of the model. But is that truly "universal" the way that a psuedo-code algorithm is universal? As in, if the authors use PyTorch (or whatever), can I take the exact model they describe in their paper and apply it in MyFavoriteMLToolkit and achieve similar results?

I guess my question is, what are the "primitives" of papers describing ML models? Is saying "convolutional layer" enough, or do they also describe the dozens of hyper-parameters, etc?

6 comments

So porting between ML frameworks was my job for a while, and the short answer is Yes, common layers can be quite simple to describe and reproduce in different frameworks. eg "Conv2D(2,3)" is enough info, in code or text, to describe a 2d convolution layer with 2 outputs and a shape of 3x3.

The longer answer is that the rest of the Conv2D configuration can then be easily overlooked, unless changed from the defaults. And those can be different across frameworks and potentially break things, even they even exist in your preferred framework. You can always create custom layers though, if needed.

But many papers also seem to do a bad job describing the actual structure of their own ML network. They can be vague, confusing, or simply inaccurate. And that can be because they are a general concept with flexible details, or because they struggle to put their model into clear words and diagrams. Or simply because they know the code is going to do the lifting.

It's a good question which might yield a very complex answer depending on how far down the rabbit hole of reproducible science/computation/machine learning you're willing to go.

To keep things simple, I'd say the true "primitives" of ML models can be reduced to mathematical formulas. For example, a plain old feed forward network is implemented as matrix multiplication. Sprinkle in a bit of calculus to analytically derive the formula for back-propagating errors (aka training), and you have the basic building blocks of modern deep learning. Convolutions, Transformers, etc are just a bit fancier spins on the same mathematical foundations.

Hyper-parameters are essentially tunable variables in a formula. I'd say your instinct is spot on - they are absolutely necessary to capture for reproducible results.

If you have the code and the data the answer should be yes. You should be able to take that PyTorch code and translate it to MyFavoriteMLToolkit to obtain numerically identical results.

In practice, we face the same universal difficulties as other computer science based research: fighting inconsistencies in software, hardware, all the way down to the physics of the universe with cosmic ray induced bit flips, etc.

> But for the many papers describing novel ML models, how does that work? They seem to use images that diagram out the different layers of the model. But is that truly "universal" the way that a psuedo-code algorithm is universal? As in, if the authors use PyTorch (or whatever), can I take the exact model they describe in their paper and apply it in MyFavoriteMLToolkit and achieve similar results?

Generally, yes.

If they are standard, well-known layers that exist in both PyTorch and TF you can take a paper that was implemented in one and implement in the other and expect similar results (assuming you know a reasonable number of details[1]).

If they are non-standard layers it can be hard. There are lots of details that you need to port and even with access to the source code it can be easy to miss things.

[1] Here's an example of how things are implemented differently - you can still get the same result, but you need to know what you are doing: https://stackoverflow.com/questions/60079783/difference-betw...

In my experience there are many lesser significant hyperparameters that can impact performance when going from the released code to your personal favorite framework.

Nothing you can't figure out by reading source code of the two frameworks or by reading the documentation closely.

Generally, people don't seem to care about reproducing exact metrics - as long as it is close enough they're happy. You need to dig a bit deeper if you want the full quality.

>But is that truly "universal" the way that a psuedo-code algorithm is universal?

My experience has been that pseudo-code is anything but universal.

In fact, having had many times to implement actual working code from research papers pseudo-code, I would posit that pseudo-code is nothing but a license for academics to provide stuff that simply doesn't work to the reader and get away with it. Thanks to pseudo-code, they get to gently skip over the hard bits to get the paper out the door as quickly as possible.

Papers with actual, git-clonable, working code, should be the standard for CS academic publishing.

It depends. Usually a paper doesn't have enough room to mention all of the possible choices in preprocessing, architecture, optimiser, etc. You can usually get pretty close with details just in the paper, but it's not always possible.

That's why a large number of journals now have requirements for publishing code and/or pretrained models (if applicable).

An annoying trend I've noticed in a number of SotA ML papers in video classification present multiple models and only publish the exact architecture & weights for the smaller models which are only as-good-as SotA (see tiny video networks, X3D for examples).