|
|
|
|
|
by jncfhnb
888 days ago
|
|
> Maybe I think the model author should have used a deeper CNN layer in the middle of the model. Without the inputs I can’t do a comparison. You can fine tune into a different model architecture. You’re right on not being able to retrain the model from scratch on half its data without that data but that’s likely pointless. |
|
> likely pointless
It doesn’t take too much creativity to come up with ideas about why someone might want to do that:
- researchers who want to investigate how much the dataset can be reduced (and thus training cost) and what the accuracy penalty is
- someone who wants to for either religious or ethical reasons minimize the probability that the model was trained on pornography
- someone who’s curious about whether there’s significant redundancy in the existing input datasets
- someone who’s curious about whether there are a much smaller subset of images in the input dataset that can quickly help the first few CNN input layers converge before training the middle and output layers on the larger dataset.
Edit: I suspect the real reason they don’t want to share the input dataset is purely because a high-quality annotated dataset is a valuable commodity. While I don’t do ML work myself day-to-day, I do work with a team that does in a very niche field and I can only imagine how much effort they had to go through to get the annotated dataset that they’ve put together. Even just collecting the images for it involved many hours of drone flights in different locales around North America in varying weather and lighting.