|
|
|
|
|
by Smerity
4226 days ago
|
|
The paper compares in detail against word2vec, but (spoiler alert) GloVe using 42 billion tokens from Common Crawl beats word2vec using 100 billion tokens from the Google News corpus! They don't actually use the 840 billion token model in the paper as it was made with some parameters that didn't allow for direct comparison, but the code and the models are all released for anyone to use from their site. This is one of many great examples of open datasets like Common Crawl allowing talented people from academia and start-ups to compete with the large proprietary datasets of Google or Bing. (disclaimer: data scientist at Common Crawl who does the crawling) |
|