Hacker News new | ask | show | jobs
by jt_b 486 days ago
I have tinkered with using DuckDB as a poor man's vector database for a POC and had great results.

One thing I'd love to see is being able to do some sort of row group level metadata statistics for embeddings within a parquet file - something that would allow various readers to push predicates down to an HTTP request metadata level and completely avoid loading in non-relevant rows to the database from a remote file - particularly one stored on S3 compatible storage that supports byte-range requests. I'm not sure what the implementation would look like to define sorting the algorithm to organize the "close" rows together, how the metadata would be calculated, or what the reader implementation would look like, but I'd love to be able to implement some of the same patterns with vector search as with geoparquet.

1 comments

I thought about this some more and did some research - and found an indexing approach using HNSW, serialized to parquet, and queried from the browser here:

https://github.com/jasonjmcghee/portable-hnsw

Opens up efficient query patterns for larger datasets for RAG projects where you may not have the resources to run an expensive vector database

Hey that's my little research project- lmk if you're interested in chatting about this stuff.

As others have mentioned in other threads, parquet isn't a great tool for the job here, but you could theoretically build a different file format that lends itself better to the problem of static file(s) representing a vector database.