| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by billconan 3180 days ago
	I thought about word2vec as a service. I gave up because I think customers could easily cache (pirate) my data.

1 comments

kleebeesh 3180 days ago

That's not really a problem if the user is frequently ingesting and vectorizing new data. They need a place to store it and efficiently query it. They can cache the queries for an old vector, but still need to compute new queries every time they have a new piece of content. Also old vectors might have relationships to the new vectors that have been inserted since caching, so you don't want to serve stale results.

link

billconan 3180 days ago

if we forget about making the service for a moment, How would you store and process high dimensional vectors locally? What ready-made library/software to use? What datastructure?

link

kleebeesh 3180 days ago

Some storage options are: - Store many vectors in a single HDF5 or LMDB file. - Store single vectors in many small binary files (e.g. using numpy save() function).

To look up neighbors you might: - Compute neighbors exhaustively (e.g. using Scipy distance functions). - Use an Approximate Nearest Neighbors approach like one of the ones benchmarked here (https://github.com/erikbern/ann-benchmarks). These can be much faster than exhaustively computing neighbors, at the cost of some accuracy and having to periodically re-build an index.

link