Hacker News new | ask | show | jobs
by blindgeek 862 days ago
Yes, the image interrogation is exactly the point. This all started out when my friend said that it would be cool to be able to chat on IRC with an LLM running on his own hardware. And then we were like, oh hey, we can get this thing to describe images for us if we use an LMM.

The next thing we want to do is obtain some glasses with cameras and wi-fi and send images to ollama from them for real-time description. The benefits are obvious, especially for mobility purposes.

2 comments

This is so cool. I’d ask how it works, however I feel like I wouldn’t understand at a fundamental level, even if I read through your codebase. Interpreting an image in the concept of a machine baffles me, it doesn’t have eyes. It surely can’t sense light like humans can. It can’t possibly understand depth (the sofa is in the far left background?!). It can’t know what a goatee is, based on some pixels that are mildly different colours than the skin or background. These are all assumptions I’ve made coming into this, and I am relatively sure I’m wrong at this stage.

If you’d like to briefly post I’m sure a lot of HN denizens would appreciate it however. I’ll just stand at the sidelines, post this and spectate the commentary and try it myself with a small group.

To be completely honest, I don't really know what I'm doing. The IRC bot I wrote isn't complicated at all; it basically just acts as a bridge between IRC and a program that has an HTTP API. FWIW I've never written an IRC bot before, so this is "baby's first bot". I also wrote it in Go, even though I'm not a Go programmer. Probably all of that shines through in the code.

The real magic happens in [ollama](https://ollama.ai/), which lets you run LMMs locally.

> Interpreting an image in the concept of a machine baffles me, it doesn’t have eyes

Your mistake here is thinking what machine has understanding of anything. It doesn't. But if you know how human learning works, what is a compression and what is a lossy compression then it is quite easy to understand.

Machine is fed with tons of images with word references what is in the image. Then it finds what is similar in the images of a similar objects, ie works just like a compression algo, except it doesn't store the exact matches but relationships of some markers it finds in the images. That's why it doesn't and doesn't need to understand where is sofa and what is a sofa, it just have a relationship between something what has a relationship to the word 'sofa' and relationship with something what we, human describe as 'position'.

Have you tried ChatGPT yet? It can describe images quite well.
It doesn't quite fit the bill of running on their own hardware
There's already a thing like this from Google. It's called lookout I think