Hacker News new | ask | show | jobs
by GuiA 2740 days ago
Indeed. For commercially useful applications, collecting the data, labeling it, etc, costs orders of magnitude more than a team of PhDs building the models.
1 comments

> For commercially useful applications, collecting the data, labeling it, etc, costs orders of magnitude more than a team of PhDs building the models.

I don't think it's typical. For example, JFT has 350e6 images, and it probably cost ~$35M to hand-label, but Google has paid people far in excess of that to work on image classification.

Google doesn’t even have to pay people. Anyone who has picked out cars or fire hydrants from their recapatchya’s is helping Label their dataset.
JFT has 17K classes. I'm assuming that they used specialized experts to tell them apart (dog breeds, plant and animal species, etc.)
Thanks.

From Google:

>Of course, the elephant in the room is where can we obtain a dataset that is 300x larger than ImageNet? At Google, we have been continuously working on building such datasets automatically to improve computer vision algorithms. Specifically, we have built an internal dataset of 300M images that are labeled with 18291 categories, which we call JFT-300M. The images are labeled using an algorithm that uses complex mixture of raw web signals, connections between web-pages and user feedback. This results in over one billion labels for the 300M images (a single image can have multiple labels). Of the billion image labels, approximately 375M are selected via an algorithm that aims to maximize label precision of selected images. However, there is still considerable noise in the labels: approximately 20% of the labels for selected images are noisy. Since there is no exhaustive annotation, we have no way to estimate the recall of the labels.

https://ai.googleblog.com/2017/07/revisiting-unreasonable-ef...

That doesn't sound like recaptcha: it's more likely that they label the pictures N (or n%) people click after searching for "Golden Retriever" in image search (as the "raw web signal")