Hacker News new | ask | show | jobs
by arcturus17 1223 days ago
I'll soon be releasing a CLI app that creates embeddings for entire Youtube channels and actually looked whether Supabase offered a pgvector plugin, but seeing as a couple weeks ago it didn't, I ended up going for Pinecone. I will add a mention to this in the docs.
5 comments

I tested pgvector against a vanilla fais index and pgvector was significantly slower with 511d vectors. If you have a small dataset (less than 100k?) vectors, its probably fine, but for larger storage, I would look at a distributed vector search provider.
we merged the pgvector PR about 2 weeks ago (https://github.com/supabase/postgres/pull/472). If you're missing anything for your CLI don't hesitate to reach out and we'll see if we can integrate it into the product (my email is in my profile)

as an aside, Pinecone looks great

https://github.com/nmslib/hnswlib

Used it to index 40M text snippets in the legal domain. Allows incremental adding.

I love how it just works. You know, doesn’t ANNOY me or makes a FAISS. ;-)

What is a memory usage with 40M vectors?
Good point! I believe it was on the order of 20 GB. Used a Hetzner 512 GB bare metal server. $50/m.

P.S. Many people seem to think that for vector search you need a GPU. You don't.

Looking forward to it. If we (Pinecone) can help with anything shoot me an email! greg@pinecone.io
do you transcribe the youtube videos or what do you mean by embeddings for youtube channels?
Yes, the CLI is a pipeline that fetches audio -> transcribes -> creates embeddings.