| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sannysanoff 681 days ago
	what are examples of local LLMs that accept images, that are mentioned in the README?

3 comments

daemonologist 681 days ago

This package seems to use llama_cpp for local inference [1] so you can probably use anything supported by that [2]. However, I think it's just passing OCR output for correction - the language model doesn't actually see the original image.

That said, there are some large language models you can run locally which accept image input. Phi-3-Vision [3], LLaVA [4], MiniCPM-V [5], etc.

[1] - https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

[2] - https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#de...

[3] - https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

[4] - https://github.com/haotian-liu/LLaVA

[5] - https://github.com/OpenBMB/MiniCPM-V

link

michaelt 681 days ago

LLaVA is one LLM that takes both text and images as inputs - https://llava-vl.github.io/

Although LLaVA specifically it might not be great for OCR; IIRC it scales all input images to 336 x 336 - meaning it'll only spot details that are visible at that scale.

You can also search on HuggingFace for the tag "image-text-to-text" https://huggingface.co/models?pipeline_tag=image-text-to-tex... and find a variety of other models.

link

katzinsky 681 days ago

I've had very poor results using LLaVa for OCR. It's slow and usually can't transcribe more than a few words. I think this is because it's just using CLIP to encode the image into a singular embedding vector for the LLM.

The latest architecture is supposed to improve this but there are better architectures if all you want is OCR.

link

eigenvalue 681 days ago

This is the best I've found so far:

https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf

But I see that this new one just came out using Llama 3.1 8B:

https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-fin...

link