Hacker News new | ask | show | jobs
by a2128 531 days ago
Sadly companies will hoard datasets and model research in the name of competitive advantage. Obviously with this specific model Microsoft chose to make it open, but this is not always the case, and it's not uncommon to read papers or technical reports saying they trained on an "internal dataset"
1 comments

Companies do have a lot of data, and some of that data might be useful for training AI. but >99% isn't. When companies do release a cool model or paper that doesn't have open data, (as you point out for competitive or other reasons privacy etc) people can then help build/collect similar open datasets. Unfortunately companies generally don't owe you their data, and if they are in the business of making models they probably won't share the model either, the situation is similar to source code for proprietary LoB applications. but fortunately the best AI researchers mostly do like to share their knowledge and because companies want to attract the best AI researchers they seem to generally allow researchers to publish if its not too commercially sensitive. It could be worse while the competitive situation has reduced some visibility of the cutting edge science, lots of datasets and papers are still published.