Hacker News new | ask | show | jobs
by CharlieDigital 1020 days ago
We used it in an e-commerce application.

Apparently, one of the hardest things to do is to match a product name + description to a product taxonomy.

There are multiple taxonomies. Here's Google's for example: https://www.google.com/basepages/producttype/taxonomy.en-US....

Amazon has their own. Walmart has their own. Target has their own.

Given a list of tens of thousands of products, how can you automatically match the product to a merchant's taxonomy?

I started with a "clever" SQL query to do this, but it turns out that it's way easier to use vector DBs to do this.

    1. Get the vector embedding for each taxonomy path and store this 
    2. Get the vector embedding for a given product using the name and a short description
    3. Find the closest matching taxonomy path using vector similarity
It's astonishingly good at doing this and solved a big problem for us which was building a unified taxonomy from the various merchant taxonomies.

You can use the same technique to match products with high confidence across merchants by storing the second vector embedding. Now you have a way to determine that product A on Target.com is the same as product A' on Walmart.com is the same as product A'' on Amazon.com by comparing vector similarity.

1 comments

Could this strategy work to match products across retailers? If so, any tips on getting started with vector databases? I've heard of them but have yet to try one out.
Yes. You compute the embedding for the product name + description from Target.com and then the embedding for the product name + description from Walmart.com. They'll have a very close vector similarity.

The easiest way to get started is with Supabase since it has a free tier and the pg_vector plugin built in.

You calculate the embedding using OpenAI's embeddings API and store the result. Then it's just a vector similarity query in Postgres (trivially easy).

Another way to do this is using the pgml extension. You can run huggingface embedding models, which have surpassed OpenAI's at this point. It's pretty fast if you run it on a machine with a gpu for acceleration. I've created embeddings on my local desktop with a 3090 for ~2,000,000 tokens in chunks of ~100 (450 characters). It took around 20 min using the gte-base model including insert into indexed table.

Still uses pg_vector.