Hacker News new | ask | show | jobs
by pm90 314 days ago
Slightly meta, but I find its a good sign that we're back to designing and blogging about in-house data storage systems/ Query engines again. There was an explosion of these in the 2010's which seemed to slow down/refocus on AI recently.
3 comments

It slowed down not because of AI, but because it turned out it was mostly pointless. Highly specialized stacks that could usually be matched in performance by tweaking an existing system or scaling a different way.

In-house storage/query systems that are not a product being sold by itself are NIH syndrome by a company with too much engineering resources.

Is it good? What's left to innovate on in this space? I don't really want experimental data stores. Give me something rock solid.
I don't disagree that rock solid is a good choice, but there is a ton of innovation necessary for data stores.

Especially in the context of embedding search, which this article is also trying to do. We need database that can efficiently store/query high-dimensional embeddings, and handle the nuance of real-world applications as well such as filtered-ANN. There is a ton of innovation in this space and it's crucial to powering the next generation architectures of just about every company out there. At this point, data-stores are becoming a bottleneck for serving embedding search and I cannot understate that advancements in this are extremely important for enabling these solutions. This is why there is an explosion of vector-databases right now.

This article is a great example of where the actual data-providers are not providing the solutions companies need right now, and there is so much room for improvement in this space.

I do not think data stores are a bottleneck for serving embedding search. I think the raft of new-fangled vector db services (or pgvector or whatever) can be a bottleneck because they are mostly optimized around the long tail of pretty small data. Real internet-scale search systems like ES or Vespa won’t struggle with serving embedding search assuming you have the necessary scale and time/money to invest in them.
Sure they can handle the basic case of ANN. But ANN still doesn’t have good stories for lots of real-world problems.

* filterable ANN, decomposes into prefiltering or postfiltering.

* dynamic updates and versioning is still very difficult

* slow building of graph indexes

* adding other signals into the search, such as query time boosting for recent docs.

I don’t disagree these systems can work but innovation is still necessary. We are not in a “data stores are solved” world.

* Filterable ANN certainly decomposes into pre- and post-filtering, and there is definitely a lot of interesting innovation occurring around filterable ANN. But large-scale search systems currently do a pretty good job with pre-filtering, falling back to brute force search in the case of restrictive filters.

* You'd have to be a bit more exact re: dynamic updates/versioning for me to understand the challenges you're facing.

* Building graph indices can be slow, but in my experience (billions of embeddings) it is possible to build HNSW indices in tens of minutes.

* How is this any different to combining traditional keyword search with, say, recency boosting?

Might be missing my argument here - I stated that there are workable solutions to this like you have pointed out.

But ANN search is still a sledgehammer and building out hybrid solutions that help bridge the gap between this and traditional data stores still have room for innovation.

> Real internet-scale search systems like ES

Oh, then you must have the secret sauce that allows scaling ES vector search beyond 10,000 results without requiring infinite RAM. I know their forums would welcome it, because that question comes up a lot

Or I guess that's why you included the qualifier about money to invest

Would you mind putting aside the snark? I have a couple questions. How large is the corpus? I am also curious about the use-case for top-k ANN, k > 10000?
Not the person you have asked but at work (we are a CRM platform) we allow our clients to arbitrarily query their userbase to find matching users for marketing campaigns (email, sms, whatsapp). These campaigns can some times target a few hundred thousand people. We are on a really ancient version of ES, but it sucks at this job in terms of throughput. Some experimenting with bigquery indicates it is so much better at mass exporting.
Agreed. The only caveat to that being a global rule is: 'At scale in a particular niche, even an excellent generalist platform might not be good enough'

But then the follow on question begs: "Am I really suffering the same problems that a niche already-scaled business is suffering"

A question that is relevant to all decision making. I'm looking at you, people who use the entire react ecosystem to deploy a blog page.

NoSQL/alternative databases became kind of a meme once people realized that 95% of enterprises can do fine with just Postgres.