I must be missing something -- why is the size of the documents a factor? If you embeded a document it would become a vector of ~1k floats, and 115k*1k floats is a couple hundred MB, trivial to fit in modern day RAM.
Embeddings are a type of lossy compression, so roughly speaking, using more embedding bytes for a document preserves more information about what it contains. Typically documents are broken down into chunks, then the embedding for each chunk is stored, so longer documents are represented by more embeddings.
Always felt they're more like hashes/fingerprints for the RAG use cases.
> Typically documents are broken down into chunks
That's what I would have guessed. It's still surprising that the embeddings don't fit into RAM though.
That said (the following I just realized), even if the embeddings don't fit into RAM at the same time, you really don't need to load them all into RAM if you're just performing a linear scan and doing cosine similarity on each of them. Sure it may be slow to load tens of GB of embedding info... but at this rate I'd be wondering what kind of textual data one could feasibly have that goes into the terrabyte range. (Also, generating that many embedding requires a lot of compute!)
> Always felt they're more like hashes/fingerprints for the RAG use cases.
Yes, I see where you’re coming from. Perceptual hashes[0] are pretty similar, the key is that similar documents should have similar embeddings (unlike cryptographic hashes, where a single bit flip should produce a completely different hash).
Nice embeddings encode information spatially, a classic example of embedding arithmetic is: king - man + woman = queen[1]. “Concept Sliders” is a cool application of this to image generation [2].
Personally I’ve not had _too_ much trouble with running out of RAM due to embeddings themselves, but I did spend a fair amount of time last week profiling memory usage to make sure I didn’t run out in prod, so it is on my mind!
Each vector is 1536 numbers. I don't know how many bits per number, but I'll assume 64 bits (8 bytes). So total size is 1536 * 115K * 8 / 1024^2 gives 1.3GB.
So yes, not a lot.
I still haven't set it up so I don't know how much space it really will take, but my 40K doc one took 2-3 GB of RAM. It's not pandas DF, but in an in-memory DB so perhaps there's a lot of overhead per row? I haven't debugged.
To be clear, I'm totally fine with your approach if it works. I have very limited time so I was using txtai instead of rolling my own - it's nice to get a RAG up and running in just a few lines of code. But for sure, if the overhead of txtai is really that significant, I'll need to switch to pure pandas.
Going further down the AI == compression path, there’s: http://prize.hutter1.net/