| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by banana_giraffe 411 days ago
	I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken. Very nice for something that's self hosted.

2 comments

accrual 411 days ago

That's pretty neat. Do you essentially loop over a list of images and run the prompt for each, then store the result somewhere (metadata, sqlite)?

link

banana_giraffe 411 days ago

Yep, exactly, just looped through each image with the same prompt and stored the results in a SQLite database to search through and maybe present more than a simple WebUI in the future.

If you want to see, here it is:

https://gist.github.com/Q726kbXuN/f300149131c008798411aa3246...

Here's an example of the kind of detail it built up for me for one image:

https://imgur.com/a/6jpISbk

It's wrapped up in a bunch of POC code around talking to LLMs, so it's very very messy, but it does work. Probably will even work for someone that's not me.

link

wisdomseaker 411 days ago

Nice! How complicated do you think it would be to do summaries of all photos in a folder, ie say for a collection of holiday photos or after an event where images are grouped?

link

banana_giraffe 411 days ago

Very simple. You could either do what I did, and ask for details on each image, then ask for some sort of summary of the group of summaries, or just throw all the images in one go:

https://imgur.com/a/1IrCR97

I'm sure there's a context limit if you have enough images, where you need to start map-reducing things, but even that wouldn't be too hard.

link

wisdomseaker 411 days ago

Thanks for the reply, I'll see if I can work it out :)

link

sorenjan 411 days ago

You might want to extract the location from the image exif data and include in the prompt as well. There are reverse geocoding libraries and services that takes coordinates and return a city, which would probably make for a better summary of a trip.

link

buyucu 410 days ago

is gemma 4b good enough for this? I was playing with larger versions of gemma because I didn't think 4b would be any good.

link

banana_giraffe 410 days ago

It certainly seemed good enough for my use. I feed it some random images I found online, you can see the sort of metadata it outputs in a static dump here:

https://q726kbxun.github.io/llama_cpp_vision/index.html

It's not perfect, by any means, but between the keywords and description text, it's good enough for me to be able to find images in a larger collection.

link