| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by daemonologist 681 days ago

This package seems to use llama_cpp for local inference [1] so you can probably use anything supported by that [2]. However, I think it's just passing OCR output for correction - the language model doesn't actually see the original image.

That said, there are some large language models you can run locally which accept image input. Phi-3-Vision [3], LLaVA [4], MiniCPM-V [5], etc.

[1] - https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

[2] - https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#de...

[3] - https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

[4] - https://github.com/haotian-liu/LLaVA