Hacker News new | ask | show | jobs
by johnvanommen 2 days ago
> Indeed, Gemini really is incredible at image analysis. Yesterday I pointed it at some sloppy handwritten notes and asked it to add up the numbers in the right column, and it did it no problem. I've also used it to find out what TV show or actor is on screen, and various other things. It's quite impressive.

I do not know if it works as well as Gemini, but Salesforce (of all places) has a model that does something similar.

What's "neat" about the Salesforce one is that you can run it locally and just iterate it over as many images as you feel like.

For instance, it should be possible to take a movie, pull a hundred images out of the h265 file, have the salesforce model evaluate what is happening at that moment in the movie, and then use that to create an index.

That's just ONE use for it, and I can think of dozens.

On a 5090 it was able to generate text descriptions of a folder full of approximately 500 images in under a minute. (Anecdotal evidence, admittedly.)

https://huggingface.co/Salesforce/blip-image-captioning-base

I just looked up some articles on it here, and it looks like it's fairly old, so YMMV.

1 comments

There is a newer BLIP-2, but it's also fairly old. You're better off with many other local models such as Moondream 3 https://huggingface.co/moondream/moondream3-preview.

Moondream is great as it can point, count, perform bounding boxes, descriptions, and visual grounded reasoning.