Do you know how this compares to [PrivateGPT](https://github.com/imartinez/privateGPT). I am honestly at the point of choice paralysis with all these new tools
It's more or less the same exact idea! Use langchain, import llm and embedding model, and query against it. The repo you provided does the same exact thing but using llama cpp python as the backend. I opted to write my own custom llm class with using textgen as the api backend so I can use the gpu since its way faster. But with the new cuBLAS support on llama cpp, it's a game changer so you can use either now. I do find the llama cpp + cuBLAS about 25% slower compared to pure GPU which is really good for what it is.
I get how there's so many choices nowadays and it's overwhelming but 95% of the repo you'll see just uses langchain. For the backend, llama cpp is your best bet minus the constant updates that break the quantized models. If you look for TheBloke on huggingface/reddit, you'll find all the best models. Look for the "ggml" ones which means its supported by llama cpp. But like I mentioned before, llama cpp has been doing so many model breaking changes so using TheBloke's models is your best bet because s/he updates really frequently. I personally prefer wizard-vicuna 13B, the uncensored one is pretty damn amazing.
Here's some example output how fast it is running 13B on a 3090 with a Ryzen 9 5900X
In [5]: output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
llama_print_timings: load time = 209.23 ms
llama_print_timings: sample time = 11.39 ms / 32 runs ( 0.36 ms per token)
llama_print_timings: prompt eval time = 209.16 ms / 15 tokens ( 13.94 ms per token)
llama_print_timings: eval time = 1806.98 ms / 31 runs ( 58.29 ms per token)
llama_print_timings: total time = 3033.91 ms
In [6]: print(output)
{'id': '', 'object': 'text_completion', 'created': 1684604167, 'model': './models/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin', 'choices': [{'text': 'Q: Name the planets in the solar system? A: 1. Mercury, 2. Venus, 3. Earth, 4. Mars, 5. Jupiter, 6. Saturn', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 15, 'completion_tokens': 32, 'total_tokens': 47}}
Ditto. There's so many llms out there now. Is there a website where these are aggregated and ranked? Man, I've been too busy to keep on top of these developments.
I get how there's so many choices nowadays and it's overwhelming but 95% of the repo you'll see just uses langchain. For the backend, llama cpp is your best bet minus the constant updates that break the quantized models. If you look for TheBloke on huggingface/reddit, you'll find all the best models. Look for the "ggml" ones which means its supported by llama cpp. But like I mentioned before, llama cpp has been doing so many model breaking changes so using TheBloke's models is your best bet because s/he updates really frequently. I personally prefer wizard-vicuna 13B, the uncensored one is pretty damn amazing.