Hacker News new | ask | show | jobs
by bebrws 1185 days ago
I tried to implement this using OpenAI's embeddings then using cosign similarity between produces vectors instead of the vector DB ( as shown in OpenAI's example code here ). I do the same thing where I take the highest ranking code snippets (vectors) and include them in a prompt to ChatGPT with an original prompt. My code is a mess since I just hacked it together after a long day of work but its a small file. Like 100 lines.

I used TreeSitter which I thought was pretty awesome though because it allows for parsing a TON of different languages. I had to parse the languages to create the different code snippet strings. I don't want to create a code snippet of half a function for example..

So TreeSitter parses the code into an AST and I send each different AST node to OpenAI to get the vector (I optimized this so multiple nodes of the same AST type are combined). Send the prompt to OpenAI to get a vector. Find the most similar code snippets to the prompt and include them at the top of a prompt to ChatGPT.

This is the same idea right? If anyones interested it can be found here: https://bbarrows.com/posts/using-embeddings-ada-and-chatgpt-...

https://github.com/bebrws/openai-search-codebase-and-chat-ab...

1 comments

I implemented this idea as well, but for PDF articles from arXiv. It's one application everyone is doing.

Next level is to select your prompt demonstrations based on the user request. Demonstrations too can be chosen by cosine similarity. The more specific they are, the better. You can "train" such a model by adding more demonstrations, especially adding failing cases (corrected) as demos.