| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by petercooper 872 days ago
	It's the multimodal input capability that seems to be of value here – see the transcript at https://2mb.codes/~cmb/ollama-bot/#chat-transcript .. Namely, being able to interrogate images in a verbal fashion, such that someone without sight (or perhaps even someone who just doesn't want to see an image) can get an appreciation for their contents.

2 comments

blindgeek 872 days ago

Yes, the image interrogation is exactly the point. This all started out when my friend said that it would be cool to be able to chat on IRC with an LLM running on his own hardware. And then we were like, oh hey, we can get this thing to describe images for us if we use an LMM.

The next thing we want to do is obtain some glasses with cameras and wi-fi and send images to ollama from them for real-time description. The benefits are obvious, especially for mobility purposes.

link

jpsouth 872 days ago

This is so cool. I’d ask how it works, however I feel like I wouldn’t understand at a fundamental level, even if I read through your codebase. Interpreting an image in the concept of a machine baffles me, it doesn’t have eyes. It surely can’t sense light like humans can. It can’t possibly understand depth (the sofa is in the far left background?!). It can’t know what a goatee is, based on some pixels that are mildly different colours than the skin or background. These are all assumptions I’ve made coming into this, and I am relatively sure I’m wrong at this stage.

If you’d like to briefly post I’m sure a lot of HN denizens would appreciate it however. I’ll just stand at the sidelines, post this and spectate the commentary and try it myself with a small group.

link

blindgeek 872 days ago

To be completely honest, I don't really know what I'm doing. The IRC bot I wrote isn't complicated at all; it basically just acts as a bridge between IRC and a program that has an HTTP API. FWIW I've never written an IRC bot before, so this is "baby's first bot". I also wrote it in Go, even though I'm not a Go programmer. Probably all of that shines through in the code.

The real magic happens in [ollama](https://ollama.ai/), which lets you run LMMs locally.

link

justsomehnguy 872 days ago

> Interpreting an image in the concept of a machine baffles me, it doesn’t have eyes

Your mistake here is thinking what machine has understanding of anything. It doesn't. But if you know how human learning works, what is a compression and what is a lossy compression then it is quite easy to understand.

Machine is fed with tons of images with word references what is in the image. Then it finds what is similar in the images of a similar objects, ie works just like a compression algo, except it doesn't store the exact matches but relationships of some markers it finds in the images. That's why it doesn't and doesn't need to understand where is sofa and what is a sofa, it just have a relationship between something what has a relationship to the word 'sofa' and relationship with something what we, human describe as 'position'.

link

carom 872 days ago

Have you tried ChatGPT yet? It can describe images quite well.

link

rolltrunhert 872 days ago

It doesn't quite fit the bill of running on their own hardware

link

loa_in_ 872 days ago

There's already a thing like this from Google. It's called lookout I think

link

jpsouth 872 days ago

Thanks! I saw that bit but honestly, skipped past it due to being on mobile and assuming it was just a list of commands. Must’ve missed the header!

Very impressed with the capability here given that transcript, I’ll certainly try it myself. Thank you!

link

baffled 872 days ago

You can give the current version a test drive at irc.oftc.net channel #speakup. The journey has been fun so far.

link