Hacker News new | ask | show | jobs
by Uncroyable 1018 days ago
exactly. brute force methods work great for thousands pieces of data of (almost) any kind. for vector data no vector db is needed; for structured data even a simple csv file with a python script will do the job and a sql db is not necessary.
1 comments

The thing is, depending on the dimensions of the vectors your using the growth rate can be pretty drastic. Each openai vector is 5kb and 1500 dimensions. So sure searching 1k vectors, just brute force it. But what if you have 100k vectors and a ton of users hammering at the index searching through it? Each search is now 150,000,000 dot products. Go up one more order of magnitude and you definitely need an index. And I don't think 1 million vectors is that much.
Number of vectors are determined by (other than original dataset ofc) how you chose to chunk the data you have available. Bigger chunks work better in terms of search (empirically) and they also keep the number of vectors down. For openai, based on prevalant norms and their cookbooks, 1M vectors likely mean 1M (more like 700K) pdf pages of text (at a token size of 1000 per embedding). That is a lot of textual data for a decent size company. Enterprises might reach that stage. Consulting firms definitely would - though they already trained and announced their own models.
700k pdf pages is not a lot. Also, you might be a business serving other businesses and indexing their documents and at reasonable (again, not google scale), 700k pages is again not a lot.

Another way to look at 700k pages is 2333 300 page books.