| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fewald_net 1092 days ago
	Great project. I learned about the faiss library. Out of curiousity, did you also try it with doc2vec?

1 comments

julien040 1092 days ago

I didn't try Doc2Vec. I wanted a hosted solution because I wouldn't have been able to compute all this locally (more than 100,000 posts).

If you tried it, did you have great results with? I may use it in future projects.

link

fewald_net 1092 days ago

Yes, I am using it on a not so small dataset (roughly 1 million docs) and the output is a fairly efficient model. I am using gensim with pre-trained word vectors. New docs can be inferred via .infer_vector().

Overall my approach is less automated than what I have seen in your codebase so it’s likely a bigger investment. I am happy to share more.

link

julien040 1091 days ago

It's very interesting. I may try it in the future.

link

jimmySixDOF 1092 days ago

The blog post link on GitHub was a nice walk through of your method and I was interested in what you think the hit rate was for getting successful text for embeddings from TFA links. 100K is a good sized corpus but wondering how many got skipped due to paywalls or 404 links or any other problems ?

link

julien040 1091 days ago

Thank you for reading it.

The hit rate is low. I've only tried to get embeddings for stories with a score greater than 100. SQL Query "SELECT count(*) FROM story WHERE score > 100;" gives me 155,228 stories and the corpus size is 108,477 stories.

108,477/ 155,228 = 0,6988236658

The main problems were 404 links and posts that weren't articles (such as tweets).

link