|
|
|
|
|
by spacetime_cmplx
1156 days ago
|
|
Sure! https://pastebin.com/xm7D1c30 I didn't bother cleaning it so it's just a code dump, but it's fairly straightforward. Not included are a Python script to parse and clean the raw documents into JSON files (used in `summarize` to output results), code to read these files and get the embeddings from OpenAI for use in `newEmbeddingJSON `, and a bunch of random parallelization shell scripts that I didn't save. To use it, I call newDBFromJSON from a directory of JSON embedding vectors and serialize the binary representation. This takes a few minutes mostly because parsing JSON is slow, but you I only needed to do this once. When I need to search for the top 10 documents most similar to document X, I call `search` with the embedding vector for that doc. Alternatively if I need to do semantic search with natural language, I'll call the OpenAI API to get the embedding vector for the query and call `search` with that vector. It's pretty fast thanks to Go concurrency maxing out my CPU. It's super accurate with the search results thanks to OpenAI's embeddings. It's nowhere close to production-ready (it's littered with panics), but it was good enough for me. Hope this helps! Edit: oh and don't use float64 (OpenAI's vectors are float16) |
|