| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by testbjjl 3 days ago
	DeepSeek interpreting screenshots and images I send it at fractions of what I pay Claude and ChatGPT, for me, is of far higher priority than supporting dictation. There are workarounds for dictation but not image processing.

3 comments

anthonypasq 3 days ago

just use one of the various cheap gemini models

link

freedomben 3 days ago

Indeed, Gemini really is incredible at image analysis. Yesterday I pointed it at some sloppy handwritten notes and asked it to add up the numbers in the right column, and it did it no problem. I've also used it to find out what TV show or actor is on screen, and various other things. It's quite impressive.

link

johnvanommen 2 days ago

> Indeed, Gemini really is incredible at image analysis. Yesterday I pointed it at some sloppy handwritten notes and asked it to add up the numbers in the right column, and it did it no problem. I've also used it to find out what TV show or actor is on screen, and various other things. It's quite impressive.

I do not know if it works as well as Gemini, but Salesforce (of all places) has a model that does something similar.

What's "neat" about the Salesforce one is that you can run it locally and just iterate it over as many images as you feel like.

For instance, it should be possible to take a movie, pull a hundred images out of the h265 file, have the salesforce model evaluate what is happening at that moment in the movie, and then use that to create an index.

That's just ONE use for it, and I can think of dozens.

On a 5090 it was able to generate text descriptions of a folder full of approximately 500 images in under a minute. (Anecdotal evidence, admittedly.)

https://huggingface.co/Salesforce/blip-image-captioning-base

I just looked up some articles on it here, and it looks like it's fairly old, so YMMV.

link

brianjking 2 days ago

There is a newer BLIP-2, but it's also fairly old. You're better off with many other local models such as Moondream 3 https://huggingface.co/moondream/moondream3-preview.

Moondream is great as it can point, count, perform bounding boxes, descriptions, and visual grounded reasoning.

link

winstonp 3 days ago

Gemini pretty clearly has the best underlying model, and the worst RL and post-training of the lot.

link

MattSayar 2 days ago

I got a shirt I liked from a conference, and I didn't know who made it. It was soft, fit comfortably... I took a picture of some random numbers on a tag and Gemini parsed out the numbers and found the manufacturer. Pretty neat

link

carterschonwald 3 days ago

gemini models are also fantastic at understanding non spoken sounds

link

jauntywundrkind 2 days ago

I don't know what runs on my phone's Google Translate app, but whatever it is, they are doing an insult to their models by it being so bad. It's amazing at picking up sound if spoken directly into the unit, but if trying to hold any kind of conversation or listen to anything even a little bit far away, it falls completely apart, is good for basically nothing.

This is obviously different than the models most people are discussing here, which are much bigger. But it's damaging the Gemini brand in general, by association, if nothing else.

link

Royce-CMR 2 days ago

I’ve long wondered if this was deliberate - only conversations where the participants are overtly using the translator get parsed.

link

segmondy 3 days ago

You can do that with smaller models at home. Gemma-4-E4B will run on a 12gb GPU, and supports audio, image, video input

link

NooneAtAll3 2 days ago

12GB GPU is a lot

link

corimaith 3 days ago

Or you could just use a CNN...

link

bigmadshoe 3 days ago

CNNs are not SoTA anymore when it comes to large models, and also are not used to provide interpretations of images as text, but rather to classify, do semantic segmentation, etc.

link

bonoboTP 3 days ago

CNNs are fine when trained with a good recipe. There are very few good studies comparing them with proper hyperparam search and all the training tricks applied consistently. Transformers are good but ViT vs CNN is not some settled issue. Transformers are more hyped and more popular with the tech enthusiasts who just read forums and news, but if you need stuff done, CNNs are still great.

link

bigmadshoe 3 days ago

I agree, but since we're talking about imagine understanding with text output, clearly a CNN is unsuitable. My previous comment was overly reductive and CNNs can still be SoTA depending on your performance metrics. I spent the earlier part of my career training CNNs, and they are very pleasant to work with.

link

bonoboTP 2 days ago

You can run a CNN and use the downsampled feature map the same way as patch tokens.

link

famouswaffles 2 days ago

>Transformers are more hyped and more popular with the tech enthusiasts who just read forums and news, but if you need stuff done, CNNs are still great.

Vits are straight up more popular for ML research now, it's not just 'tech enthusiasts'.

link

bonoboTP 2 days ago

There's a dearth of research properly comparing them.

link

famouswaffles 2 days ago

I'm talking about research pushing state of the art in computer vision. Vits have 100% become more popular than CNNs in most CV research.

link

tehjoker 3 days ago

Can you say more about that? I haven't kept up.

link

crypto420 3 days ago

CNNs excel in vision tasks where you have limited compute, limited memory, limited data, and want something that works super well and quick. People usually don't hook CNNs up to a transformer to get language understanding either, you have to train bespoke CNNs for specific tasks

ViTs excel where you're unbounded in compute + data and also want text understanding or have a conversation about an image

link

bonoboTP 2 days ago

These are vibes. ViT has been shown to work fine on small data with proper hyperparam and most of what you mention is actually doable just fine with the other architecture as well.

link

Jabrov 3 days ago

Transformers are superior

link

nullstyle 3 days ago

Which?

link