Hacker News new | ask | show | jobs
by nl 4226 days ago
There is great, big data driven research coming out of Stanford using Common Crawl. For example, see http://www-nlp.stanford.edu/projects/glove/ . They successfully train an 840 billion token corpus.

I haven't seen this paper before (thanks!!). How different is it to Word2Vec?

Clearly the pre-trained vectors at that scale (and much bigger than the ones released with Word2Vec) are new and very exciting.

1 comments

The paper compares in detail against word2vec, but (spoiler alert) GloVe using 42 billion tokens from Common Crawl beats word2vec using 100 billion tokens from the Google News corpus!

They don't actually use the 840 billion token model in the paper as it was made with some parameters that didn't allow for direct comparison, but the code and the models are all released for anyone to use from their site.

This is one of many great examples of open datasets like Common Crawl allowing talented people from academia and start-ups to compete with the large proprietary datasets of Google or Bing.

(disclaimer: data scientist at Common Crawl who does the crawling)

Good link, thanks for pointing it out. Re: not clear cut, that's always the case to varying degrees :) To quote the author's in document response to the Google Doc you just linked to:

"Update by Richard Socher (Nov 2014): This document is outdated and its concerns have been addressed in the final version of the GloVe paper. Glove gets better performance on the same training data when actually run to convergence. See last section of Glove paper for details."

This is a good example of peer review in academia beyond just the paper review committee -- other researchers point out concerns or issues with methodology and they're addressed by the authors or other contributors. It's also great that the initial concerns could be properly tested thanks to the open source nature of both projects.

I will admit I didn't discuss the intricacies of the evaluation in my few paragraphs above, I was primarily speaking to the broader point that open data is helping academia compete with the goliaths of industrial research! =]

Interesting.

As I said in my other comment, one of the strengths of Word2Vec is how robust it is against various metrics.

While it looks like GloVe's advantages over Word2Vec may be not as much as initially claimed, it is mostly as robust (which is good). However, the jump in Word+Context over just Word vectors when evaluated on semantic relations is interesting.

(To be clear: I'm very interested be being able to use the same system over diverse datasets, without having to tune it differently for each system - hence my interest in the robustness of the methodologies)

Edit: Were you and Smerity at Sydney Uni at the same time?

The paper compares in detail against word2vec, but (spoiler alert) GloVe using 42 billion tokens from Common Crawl beats word2vec using 100 billion tokens from the Google News corpus!

Damn!!

Background for those who don't follow this field: Word2Vec is an apparently miraculous demonstration and poster-child of the unreasonable effectiveness of big data. Beating it at all is impressive, assuming the performance is as robust as Word2Vec is against different metrics.

Beating it with only 42% of the tokens is wondrous.