| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by loxias 1038 days ago

I get your larger point, but the errors and phrasing are a bit off putting.

Vector similarity alone _IS_ enough for vector search. That's literally what "search" means in this context! Finding another vector within an epsilon bound given a metric. After the 3rd read, I understand the point you're trying to make I think, and I think you might be right.

There might be room in the market for an integrator, an all in one platform. It won't have the best performance or functionality, I doubt it would win in _any_ category. But if you can get the business model working right I could imagine such a product having sizeable market share. Hm...

Edit: I'm also curious about the dimension and metric used. Any numbers about latency or size is kinda pointless without :).

1 point in 1536-D space (what OpenAI uses),4 byte float == 6KB, so even 100 million points is only 600G...

2 comments

jn2clark 1038 days ago

Regarding metric and dimension - it is really problem dependent as is throughput. Recall and latency numbers reported in benchmarks are typically on very well curated and structured datasets and average across all queries. Recall is not just a function of the HNSW algorithm. I can tell you though you can do 70M vector indexes with 768 dimensions <100ms including inference on very real world datasets. We will publish some benchmarks shortly as we are doing more evaluations on real world data. I also compiled throughput on open CLIP models here as well https://docs.google.com/spreadsheets/d/1ftHKf4MovnAyKhGyi05e.... If there are particular things you want to see let us know and we can add them!

link

loxias 1038 days ago

> it is really problem dependent as is throughput. Recall and latency numbers reported in benchmarks are typically on very well curated and structured datasets and average across all queries

This is correct. :) Don't worry, I know enough to not trust any published benchmarks on this topic... (I'm also not your target market. I wrote my first "vector DB" in 2001 for music recognition.)

I still think it's crucial to include just a few more facts though, because otherwise the statement is meaningless.

Consider:

A. "we can find an approximate NN match, euclidean, D=768, N=70000000, under 100ms on a modern laptop"

B. "we can find an approximate NN match, euclidean, D=2, N=70000000, under 100ms on a modern laptop"

C. "we can find an approximate NN match, euclidean, D=768, N=70000000, under 100ms on 1000x modern laptops"

Notice how B and C aren't impressive, they're trivially beatable. :)

link

jn2clark 1038 days ago

I think it depends a bit on the definition of search here. It might satisfy a literal definition of search but not search as users would expect - which I think is the important point. IMHO vector similarity and vector search are conflated too much and solving search problems as users expect them requires more than similarity.

link

loxias 1038 days ago

I think you might be on to something, in thinking about it in terms of the platform from the perspective of the end user, and what they build on it.

I humbly posit that you might be better off, at least from a communications/marketing perspective, ditching the "vector search without vectors" verbage because that alienates the segment that, uh, for lack of a better term, loves and understands high dimensional applied math, and computers. :)

Perhaps instead find language that couches it as an entirely new category. Blue ocean. Ditch the word "vector" entirely.

-$0.02

link

jn2clark 1038 days ago

Thanks for the feedback and questions - really appreciate it.

link

_false 1038 days ago

Why not semantic search?

link

rmilejczz 1037 days ago

Definitely, RAG programs often grab lots of unneeded context and sometimes miss crucial context. Improving this would be huge imo, for example in something like cursor.

link