| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by microtonal 4022 days ago

Mikolov, et al. 2013 [1] do a proper evaluation of this. E.g. they found that the skip-ngram model has a 50.0% accuracy for semantic analogy queries and 55.9% accuracy for syntactic queries.

word2vec comes with a data set that you can use to evaluate language models.

[1] http://arxiv.org/pdf/1301.3781.pdf

1 comments

rspeer 4022 days ago

I would insist on a better dataset before really calling these "semantic analogies" (and don't just take my word for it: Chris Manning complained about exactly this in his recent NAACL talk).

The only semantics that it tests are "can you flip a gendered word to the other gender", which is so embedded in language that it's nearly syntax; and "can you remember factoids from Wikipedia infoboxes", a problem that you could solve exactly using DBPedia. Every single semantic analogy in the dataset is one of those two types.

The syntactic analogies are quite solid, though.

link

microtonal 4021 days ago

and "can you remember factoids from Wikipedia infoboxes",

That's a simplification. E.g. I have trained vectors on Wikipedia dumps without infoboxes, and I queries such as Berlin - Deutschland + Frankreich work fine.

Of course, even the remainder of Wikipedia is nice text in that it will contain sentences such as 'Berlin is the capital of Germany'. So, indeed, it makes doing typical factoid analogies easier.

That said -- I am more interested in the syntactic properties :).

link

rspeer 4020 days ago

I didn't mean that you have to learn the data from Wikipedia infoboxes, just that that's a prominent place to find factoids.

It's a data source that you could consult to pass 99% of the "semantic analogy" evaluation with no machine learning at all, which is an indication that a stronger evaluation is needed.

link