| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by embedding-shape 213 days ago
	People generally sleep when you start talking about fine-tuned BERT and CLIP, although they do a fairly decent job as long as you have good data and know what you're doing. But no, they want to pay $0.1 per request to recognize if a photo has a person in it by asking a multimodal LLM deployed across 8x GPUs, for some reason, instead of just spending some hours with CLIP and run it effectively even on CPU.

4 comments

Aurornis 213 days ago

> they do a fairly decent job as long as you have good data and know what you're doing.

This is the bottleneck in my experience. Going for the expensive per-request LLM gets something shipped now that you can wow the execs with. Setting up a whole process to gather and annotate data, train models, run evals, and iterate takes time. The execs who hired those expensive AI engineers want their results right now, not after a process of hiring more people to collect and annotate the data.

link

cestith 213 days ago

I’m no ML engineer and far from an LLM expert. Just reading the article though it seemed to me that leveraging an SQL database here was a bigger issue than using traditional ML on the data, rather than the LLM being a win specifically. Just finding anything that was better suited than string matching on a RDBMS to the type of inputs seems like the natural conclusion when the complaint in the article itself was literally about SQL.

link

keeda 213 days ago

>... as long as you have good data and know what you're doing.

I think you've just identified, in a set-theoretic complementary manner, the TAM for GenAI.

link

throwaway314155 212 days ago

What's TAM?

link

keeda 212 days ago

https://en.wikipedia.org/wiki/Total_addressable_market

link

efavdb 213 days ago

Are you suggesting use the clip embedding for the text as a feature to train a standard Ml model on?

link

daemonologist 213 days ago

I think they're suggesting doing that with BERT for text and CLIP for images. Which in my experience is indeed quite effective (and easy/fast).

There have been some developments in the image-of-text/other-than-photograph area though recently. From Meta (although they seem unsure of what exactly their AI division is called): https://arxiv.org/abs/2510.05014 and Qihoo360: https://arxiv.org/abs/2510.27350 for instance.

link

PaulHoule 213 days ago

I think he is. I do things like that plenty.

link