Hacker News new | ask | show | jobs
by flyingcircus3 890 days ago
I'm genuinely confused at how you made these assumptions about what I'm describing. Because the "more correct" design you contrast with the strawman you've concluded I'm describing is actually what I'm talking about, if perhaps imprecisely. A pretrained model like mobilenetV2, with its final layer removed, and custom models trained on bird and house images, which take this mobilenetv2[:-1] output as input. MobilenetV2 is 2ish megabytes at 224x224, and these final bird and house layers will be kilobytes. Having two multiple-megabyte models that are 95% identical is a giant waste of our embedded target's resources. It also means that a scheme that processed a single image with two full models (instead of one big, two small) would spend 95% of the second full model's processing time redundantly performing the same operations on the same data. Breaking up the models across two stages produces substantial savings of both processing time and flash storage, with a single big model as the "feature detection" first stage of both overall inferences, with small specialized models as a second stage.
1 comments

Sorry to upset you. It was not clear from your description that this was the process you were referring to. Others will read what you wrote and likely misunderstand as I did. (Which was my concern because I've seen the "mixture of idiots" architecture attempted since 2015. Even now... Its a common misconception and an argument every ml practitioner has at one point or another with a higher up.)

As for your ammendment, it is good to reduce compute when you can, and reduce up front effort for model creation when you can. Reusing models may be valid, but even in your ammended process you will still end up not reaching the peak performance of a single end to end model trained on the right data. Composite models are simply worse, even when transfer learning is done correctly.

As for the compute cost, if you train an end to end model and then minify it to the same size as the sum of your composite models it will have identical inference cost, but higher peak accuracy.

You could even do that with the "Shared Backbone" architecture, as youve described where two tailnetworks share a head network. It has been attempted thoroughly in the Deep Reinforcement Learning subdomain I am most familiar, and result in unnecessary performance loss. So it's not generally done anymore.

Man, everyone at work is going to be really bummed when I tell them that some guy on the internet has invalidated our empirical evidence of acceptable accuracy and performance with assumptions and appeals to authority.
I did not say it would not work, nor that it couldnt be acceptable performance for a given task.

Just that its peak performance is lower than an end to end model, and that if youre going to encourage model kit-bashing be clear how you communicate it, so people dont make human centipede architectures and wonder why feces is what comes out the end.

I was a very polite enough "some guy on the internet". Thank you.