|
|
|
|
|
by embedding-shape
213 days ago
|
|
People generally sleep when you start talking about fine-tuned BERT and CLIP, although they do a fairly decent job as long as you have good data and know what you're doing. But no, they want to pay $0.1 per request to recognize if a photo has a person in it by asking a multimodal LLM deployed across 8x GPUs, for some reason, instead of just spending some hours with CLIP and run it effectively even on CPU. |
|
This is the bottleneck in my experience. Going for the expensive per-request LLM gets something shipped now that you can wow the execs with. Setting up a whole process to gather and annotate data, train models, run evals, and iterate takes time. The execs who hired those expensive AI engineers want their results right now, not after a process of hiring more people to collect and annotate the data.