Right, and that's what I find frustrating. There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.
Instead, you're supposed to upload it to the cloud and ask a big, multimodal frontier model to maybe please do the thing you want and nothing else.
The highest return small local model for me has been the in-built OCR that macOS has. It has finally "solved" OCR by making high-quality results accessible to everyone. Yet the state of art outside the apple ecosystem seems to be tesseract (poor results), or extremely heavy VLMs.
I think 3? I feel like that's often enough. Sometimes it's nice to do a quick dumb ass gag on a whim. If I am anything I am a man who loves a dumb ass gag.
(I'm counting only times I used generative editing options in my Galaxy phone - if I were to take your question literally, it would be "at least once every other day", simply due to rotating and cropping.)
Instead, you're supposed to upload it to the cloud and ask a big, multimodal frontier model to maybe please do the thing you want and nothing else.