Hacker News new | ask | show | jobs
by wcedmisten 1039 days ago
How accurate is an LLM for this task? I was thinking of using one for analyzing free form PDF text to find a specific element, but I was worried about hallucinations.
2 comments

Extractive tasks are part of where LLMs shine, and where you get the least amount of hallucination as long as you fine-tune your model.

By fine-tuning the model to extract a specific desired output from the text you give it, it learns that the output always comes from the input, and so you get less random outputs than just by prompting an instruction-tuned model (which was fine-tuned to find the answer in its weights, instead of copying it from the input).

I'm pretty ignorant on which is the best self hosted LLM for such a task or how to fine-tune it. Do you know of any resources on how to set that up?

It seems like llama2 is the biggest name on HN when it comes to self hosting but I have no idea how it actually performs.

You could just try it out if you have the hardware at home.

Grab KoboldCPP and a GGML model from TheBloke that would fit your RAM/VRAM and try it.

Make sure you follow the prompt structure for the model that you will see on TheBloke's download page for the model (very important).

KoboldCPP: https://github.com/LostRuins/koboldcpp

TheBloke: https://huggingface.co/TheBloke

I would start with a 13b or 7b model quantized to 4-bits just to get the hang of it. Some generic or story telling model.

Just make sure you follow the prompt structure that the model card lists.

KoboldCPP is very easy to use. You just drag the model file onto the executable, wait till it loads and go to the web interface.

Won't you run out of context size though? The older models only went up to 2000 tokens, newer ones up to 16k.

Ie how do you feed the LLM the text along with your question without it forgetting most of the text? I assume the text you want to feed it is longer than 16,000 words.

For my use-case the PDFs are only a few pages long generally, so I think the 16k word limit would be well within my needs. I'm trying to find a list of device names from an FDA 510k summary (for medical device clearances). Currently I'm doing this manually and it's quite time consuming. I have around 15,000 PDFs to get through manually, but it's pretty slow work.