|
|
|
|
|
by nl
4226 days ago
|
|
There is great, big data driven research coming out of Stanford using Common Crawl. For example, see http://www-nlp.stanford.edu/projects/glove/ . They successfully train an 840 billion token corpus. I haven't seen this paper before (thanks!!). How different is it to Word2Vec? Clearly the pre-trained vectors at that scale (and much bigger than the ones released with Word2Vec) are new and very exciting. |
|
They don't actually use the 840 billion token model in the paper as it was made with some parameters that didn't allow for direct comparison, but the code and the models are all released for anyone to use from their site.
This is one of many great examples of open datasets like Common Crawl allowing talented people from academia and start-ups to compete with the large proprietary datasets of Google or Bing.
(disclaimer: data scientist at Common Crawl who does the crawling)