| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by blindgeek 910 days ago
	Yes, the image interrogation is exactly the point. This all started out when my friend said that it would be cool to be able to chat on IRC with an LLM running on his own hardware. And then we were like, oh hey, we can get this thing to describe images for us if we use an LMM. The next thing we want to do is obtain some glasses with cameras and wi-fi and send images to ollama from them for real-time description. The benefits are obvious, especially for mobility purposes.

2 comments

jpsouth 910 days ago

This is so cool. I’d ask how it works, however I feel like I wouldn’t understand at a fundamental level, even if I read through your codebase. Interpreting an image in the concept of a machine baffles me, it doesn’t have eyes. It surely can’t sense light like humans can. It can’t possibly understand depth (the sofa is in the far left background?!). It can’t know what a goatee is, based on some pixels that are mildly different colours than the skin or background. These are all assumptions I’ve made coming into this, and I am relatively sure I’m wrong at this stage.

If you’d like to briefly post I’m sure a lot of HN denizens would appreciate it however. I’ll just stand at the sidelines, post this and spectate the commentary and try it myself with a small group.

link

blindgeek 910 days ago

To be completely honest, I don't really know what I'm doing. The IRC bot I wrote isn't complicated at all; it basically just acts as a bridge between IRC and a program that has an HTTP API. FWIW I've never written an IRC bot before, so this is "baby's first bot". I also wrote it in Go, even though I'm not a Go programmer. Probably all of that shines through in the code.

The real magic happens in [ollama](https://ollama.ai/), which lets you run LMMs locally.

link

justsomehnguy 909 days ago

> Interpreting an image in the concept of a machine baffles me, it doesn’t have eyes

Your mistake here is thinking what machine has understanding of anything. It doesn't. But if you know how human learning works, what is a compression and what is a lossy compression then it is quite easy to understand.

Machine is fed with tons of images with word references what is in the image. Then it finds what is similar in the images of a similar objects, ie works just like a compression algo, except it doesn't store the exact matches but relationships of some markers it finds in the images. That's why it doesn't and doesn't need to understand where is sofa and what is a sofa, it just have a relationship between something what has a relationship to the word 'sofa' and relationship with something what we, human describe as 'position'.

link

carom 910 days ago

Have you tried ChatGPT yet? It can describe images quite well.

link

rolltrunhert 910 days ago

It doesn't quite fit the bill of running on their own hardware

link

loa_in_ 910 days ago

There's already a thing like this from Google. It's called lookout I think

link