| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kleebeesh 3140 days ago
	That's not really a problem if the user is frequently ingesting and vectorizing new data. They need a place to store it and efficiently query it. They can cache the queries for an old vector, but still need to compute new queries every time they have a new piece of content. Also old vectors might have relationships to the new vectors that have been inserted since caching, so you don't want to serve stale results.

1 comments

billconan 3140 days ago

if we forget about making the service for a moment, How would you store and process high dimensional vectors locally? What ready-made library/software to use? What datastructure?

kleebeesh 3140 days ago

Some storage options are: - Store many vectors in a single HDF5 or LMDB file. - Store single vectors in many small binary files (e.g. using numpy save() function).

To look up neighbors you might: - Compute neighbors exhaustively (e.g. using Scipy distance functions). - Use an Approximate Nearest Neighbors approach like one of the ones benchmarked here (https://github.com/erikbern/ann-benchmarks). These can be much faster than exhaustively computing neighbors, at the cost of some accuracy and having to periodically re-build an index.