Hacker News new | ask | show | jobs
by simonw 925 days ago
"While there are many OSS for Loading personal data, they dont do images or videos"

Local models for images are getting pretty good.

LLaVA is an LLM with multi-modal image capabilities that runs pretty well on my laptop: https://simonwillison.net/2023/Nov/29/llamafile/

Models like Salesforce BLIP can be used to generate captions for images too - I built a little CLI tool far that here: https://github.com/simonw/blip-caption

1 comments

CogVLM blows LLaVA out of the water, although it needs a beefier machine (quantized low-res version barely fits into 12GB VRAM, not sure about the accuracy of that).
I have no actual knowledge in this area so I'm not sure if it's entirely relevant but an update from the 7th of December on the CogVLM repo says it now works with 11GB of VRAM.