|
|
|
|
|
by zzleeper
1045 days ago
|
|
Do you have any suggestions about how to start implementing something like this in-house? I'm sitting on thousands of PDFs (that can be trivially turned into text) and it would be really useful to train an LLM on them for information retrieval. But the dev and computing cost of this feels so huge that I'm not even sure where to start. |
|
While it really resonated with my management I felt worried I wouldn't be able to replicate these kind of results on other projects.
THE ONLY REAL ADVICE I CAN GIVE ON AI PROJECTS IS . . . don't let your managements expectation of LLMs out weigh its capabilities.
I'm sure I speak for many people here when your non-tech fluent directors get together and think GPT4 is some sort of deity. GPT4 smart (or used to be at least) ill give it that, but small locally hosted 7b/13b LLMs are very limited and people for whatever reason get AI infatuation the second they finally see you show direct value in it they will lose there shit in its assumed capabilities. you got to be direct with them that no matter what dumb video they saw on Sam Altman, what your are proposing is not that. Be very clear in its possible scope because there is some idiot in our organization that will assume assume you can programmatically answer prayers. I actually had this guy from our networking team try and raise a concern about the LLM going sentient and us having a "Skynet" problem. granted this was back in march/2023 so AI histira was a little more rampant but still.
tl;dr my recommendation for your pdf project is run https://github.com/oobabooga/text-generation-webui. if your can get a 30 series GPU in your company Then run a 13B 4bit model that can pull info, assign tags, run minor analysis on your text. else find a spare 16gb machine and do the same but but over a longer time scale.
run a prompt that checks for hallucinations. "does the following text make sense? previous prompt + text if yes then keep else make intern do it.
GPT-j-7b is still one of the best models because it has indexing & categorizing at the main prosperous. other models are great but core idea behind LLMs is that its just a high level auto complete