Hacker News new | ask | show | jobs
by mpercy 1020 days ago
How are people actually using vector databases?

The closest explanation to a use case architecture I've seen recently was https://mattboegner.com/knowledge-retrieval-architecture-for... - it basically describes doing knowledge retrieval (keyword parsing) from LLM queries, feeding that to a vector db to do similarity search to get a top K similar documents to the parsed keywords, then feeding that list that back into the LLM as potential useful documents it can reference in its response. It's neat but it seems a bit hacky. Is that really the killer app for these things?

2 comments

We used it in an e-commerce application.

Apparently, one of the hardest things to do is to match a product name + description to a product taxonomy.

There are multiple taxonomies. Here's Google's for example: https://www.google.com/basepages/producttype/taxonomy.en-US....

Amazon has their own. Walmart has their own. Target has their own.

Given a list of tens of thousands of products, how can you automatically match the product to a merchant's taxonomy?

I started with a "clever" SQL query to do this, but it turns out that it's way easier to use vector DBs to do this.

    1. Get the vector embedding for each taxonomy path and store this 
    2. Get the vector embedding for a given product using the name and a short description
    3. Find the closest matching taxonomy path using vector similarity
It's astonishingly good at doing this and solved a big problem for us which was building a unified taxonomy from the various merchant taxonomies.

You can use the same technique to match products with high confidence across merchants by storing the second vector embedding. Now you have a way to determine that product A on Target.com is the same as product A' on Walmart.com is the same as product A'' on Amazon.com by comparing vector similarity.

Could this strategy work to match products across retailers? If so, any tips on getting started with vector databases? I've heard of them but have yet to try one out.
Yes. You compute the embedding for the product name + description from Target.com and then the embedding for the product name + description from Walmart.com. They'll have a very close vector similarity.

The easiest way to get started is with Supabase since it has a free tier and the pg_vector plugin built in.

You calculate the embedding using OpenAI's embeddings API and store the result. Then it's just a vector similarity query in Postgres (trivially easy).

Another way to do this is using the pgml extension. You can run huggingface embedding models, which have surpassed OpenAI's at this point. It's pretty fast if you run it on a machine with a gpu for acceleration. I've created embeddings on my local desktop with a 3090 for ~2,000,000 tokens in chunks of ~100 (450 characters). It took around 20 min using the gte-base model including insert into indexed table.

Still uses pg_vector.

Yes, GPU poor people are just using top k semantic search to try to fix the issues will low ram low knowledge LLMs. It's OK for some applications, but other methods need to be investigated.