Hacker News new | ask | show | jobs
by fbdab103 696 days ago
I might have a work use case for which this would be perfect.

Having no experience with word2vec, some reference performance numbers would be great. If I have one million PDF pages, how long is that going to take to encode? How long will it take to search? Is it CPU only or will I get a huge performance benefit if I have a GPU?

3 comments

As someone working extensively with word2vec: I would recommend to set up Elasticsearch. It has support for vector embeddings, so you can process your PDF documents once, write the word2vec embeddings and PDF metadata into an index, and search that in milliseconds later on. Doing live vectorisation is neat for exploring data, but using Elasticsearch will be much more convenient in actual products!
I would personally vote for Postgres and one of the many vector indexing extensions over Elasticsearch. I think Elasticsearch can be more challenging to maintain. Certainly a matter of opinion though. Elasticsearch is a very reasonable choice.
Elasticsearch is a pain to maintain, the docs are all over the place, and the API is what you end up with if developers run free and implement everything that jumps to their mind.

But there just isn’t anything comparable when it comes to building your own search engine. Postgres with a vector extension is good if all you want to do is a vector search and some SQL (not dismissing it here, I love PG); but if you want more complex search cases, Elasticsearch is the way to go.

Vector stuff doesn’t take much. It’s essentially the same computational time as regular old text search.
No. Look at the implementation. Gpu isn't going to give you huge gains in this one