|
|
|
|
|
by spacetime_cmplx
1158 days ago
|
|
Unless you have several hundred million documents, just write a simple encoder that serializes the embedding vectors to a flat binary file. Writing code from scratch to process and search 200k unstructured documents -- parsing, cleaning, chunking, OpenAI embedding API, serialization code, linear search with cosine similarity, and the actual time to debug, test and run all this -- took me less than 3 hours in Go. The flat binary representation of all vectors is under 500 MB. I even went ahead and made it mmap-friendly for the fun of it even though I could read it into all into memory. Even the dumb linear search I wrote takes just 20-30ms per query on my Macbook for the 200k documents. The search results are fantastic. |
|