Hacker News new | ask | show | jobs
by visarga 3270 days ago
> they have access, and can purchase, the largest and best data sets available

Google might have an advantage in personal data, that can be used for advertising and health, but when it comes to general data, such as image datasets and NLP datasets, they can be found in the public domain and are growing fast. There is just a specific, limited advantage to Google in datasets. Mostly for ads.

3 comments

The largest, most interesting recent public datasets in image and NLP were released by Google.

For example, here are some of their recent NLP datasets: https://github.com/google-research-datasets

In images, OpenImages is theirs, and there are assorted ones derived from YouTube.

Stanford's SNLI is the most recent non-Google NLP dataset which is getting used a lot. Babi (from FB) too, if you count that as NLP

The best data set will in general only be as good as the raw data that was used to prepare it.

I think you underestimate just how far along Google is with respect to the huge amounts of raw data they handle. They've been around for 20 years now and amassed a lot of expertise handling all kinds of data imaginable at scale.

If you disagree, who would you say is ahead of Google wrt general data sets that are valuable?

That's true, but Google can also afford to acquire and monopolize data that other companies are sitting on but don't have the resources or talent to utilize internally.