Hacker News new | ask | show | jobs
Moebius: 0.2B image inpainting model with 10B-level performance (hustvl.github.io)
163 points by DSemba 6 hours ago
15 comments

I did an inpainting project for a client a few years ago. They were trying to inpaint banner ads for concert promoters, and find a way to make it easy to produce a bunch of different sized ads for a variety of placements. I was tasked with inpainting Xmas themed ad for a few major singers.

The weirdest thing was when the inpainting tool added strange people to an image. This singer was all decked out in tinsel and red, and the inpainting model added a grumpy old man in a top hat. I don't recall clicking the "Add creepy old man" button.

At the time this was Stable Diffusion on the backend, run by a variety of model hosting services, Amazon being one. They all had different requirements for the input image and that made things really complex. For some the aspect ratio was impossible to meet, and it would fail if the banner was 200x60. For others, you had to resize it before input, which meant you were adding an image with poor resolution to start. Garbage in, garbage out.

All of this to say, there is a lot of preproduction that went into it, and the client never ended up using my attempts.

> For others, you had to resize it before input, which meant you were adding an image with poor resolution to start.

Thats because small models like SD (Stable Diffusion) are trained on very specific resolutions, its the fancier models that are trained on higher quality, or more diverse sets of resolutions, and if you use a higher quality model to generate lower resolution images, what's actually happening is you're trimming a much bigger image and getting a chunk of it output, at least that's how it feels based on my many hours of experimenting. If I use major models and try to center a thing, I never see it in the center. :) My GPU can only handle so much.

I want a version of this for manga (for translation). Right now I think the go-to lightweight inpainting model for anime and manga is LaMa which is several years old now and it feels like there is room for improvement.
I've been working on trying to outpaint an animated program for my son (Leapfrog Letter Factory if you're curious) and then upscale it. Doing so locally has been actually fairly difficult. I wonder if you could retrain or fine tune this model. They mention building an expert, I wonder if that expert could understand more about translating various characters.
There are some demo spaces using this. This one seems the best (paint your own mask) but it failed on all the images I tried: https://huggingface.co/spaces/multimodalart/Moebius
Tried a bit, and while it is very impressive for 0.2B model it would be very hard to convince me that this matches with 10B models. It did work reasonably well with natural images but inpainted regions were visibly smoother than surroundings, and performed very badly on novel objects. It is also limited to 512x512 output, which limits its practical usefulness.
> The core insight of Moebius can be summarized in a single equation: Synergy × (Architecture + Distillation) = Shattering the "Impossible Triangle" of Low Parameters, Fast Inference, and High Quality

Is it just me or is it weird seeing these clickbaity AI-generated taglines in an otherwise scientific work?

After "In Good Company" i can't hear (or see) the word Synergy without cringing.
It IS weird, but it "converts" (ugh...), that's why they coming.

Apart from this, the text details amazing work. Congrats.

I don't understand. Is it available somewhere to try or is it just an ad?
Yeah it's great but how do I use it?

Edit: I think I found it https://huggingface.co/hustvl/Moebius

with this size we could have a interaactive web demo.
This is the useful AI stuf. There’s so many usecases this makes possible.
Right, and that's what I find frustrating. There are so many use cases where a local, purpose-built model that's dependably good at one thing would really make a difference. But no one is going to throw a billion dollars to give us amazing dust removal, flawless scene segmentation, etc.

Instead, you're supposed to upload it to the cloud and ask a big, multimodal frontier model to maybe please do the thing you want and nothing else.

The highest return small local model for me has been the in-built OCR that macOS has. It has finally "solved" OCR by making high-quality results accessible to everyone. Yet the state of art outside the apple ecosystem seems to be tesseract (poor results), or extremely heavy VLMs.
how many times have you edited a photo you took on your phone in the last 7 days?
I think 3? I feel like that's often enough. Sometimes it's nice to do a quick dumb ass gag on a whim. If I am anything I am a man who loves a dumb ass gag.
Good on you. I've laughed at many dumbass gags but I've only been a passive consumer of them.
Half a dozen at least.

(I'm counting only times I used generative editing options in my Galaxy phone - if I were to take your question literally, it would be "at least once every other day", simply due to rotating and cropping.)

Personally, about 9 times. Would be higher if it was even easier and cheaper
Nitpick: in the showcase on that page, under Comparison of Natural Scenes, Moebius should definitely get a "structural confusion" tag for the back of the surfboard. If other models get deducted for truncating the surfboard, then surely the elongation that Moebius does should count too.

Also, what's going on behind the in-painted corner of the house? We'd need to see higher resolution pictures, but I'm not convinced that it too shouldn't get a flag. Likewise with the beach just behind the surfboard. Not terrible, but what gets flagged in the competitors is similar.

What is the current SOTA for impainting?

I have a potential project for my e-commerce where I want to allow users to upload images of their house exteriors and impaint awnings.

Proprietary? Either gpt-image-2 or NB2.

I have an example of interior decorating inpainting where I replaced a large floor-to-ceiling window with a mirror, and the result was pretty impressive using NB Pro from nearly a year ago.

https://imgpb.com/ZXkiXV

Locally hostable? For my money I'd argue Flux.2 Klein but Qwen-Edit still puts in the work.

NB2 means "Nano Banana 2", a Google image generation model. https://blog.google/innovation-and-ai/technology/ai/nano-ban...
For locally hostable image editing models, the edit variant of the recently released Boogu-Image[1] model is very good. Anecdotally, I'd say way better than Flux.2 Klein 9B and Qwen-Edit.

[1]: https://github.com/boogu-project/Boogu-Image

As far as I know, gpt-image-2 doesn't even let you define a mask unless you've already run it through one iteration, and once you do define the mask, it just ignores it 90% of the time. It's utterly useless for inpainting. Also, this and other proprietary models are severely limited in their output resolution.

I do agree, however, that the Flux2 family is the SoTA at the moment. Running locally via something like Comfy gets incredible results.

Yeah definitely. You can do workarounds like drawing circles or using highlighters to create pseudo-masks for use with OpenAI or Google models but it’s really just a visual indication more than anything.

If you want real precision (especially for complex polygonal masks), or if you’re concerned about image degradation over multiple edit rounds, you'll slam against the limitations of those approaches.

Even with SOTA proprietary models, repeatedly editing and re-uploading an image is like making a copy of a copy of a VHS tape: you're gonna see subtle color shifts and quality loss steadily accumulate.

At that point, you either need to put in the manual work in something like Photoshop (bringing elements in as layers and masking them properly) or, as you mentioned, use a model or workflow that properly supports masking.

Awnings, if I understand correctly (I just learned this word right now), are purely additive attachments to structure exteriors - so perhaps they wouldn't necessarily need a full inpainting model? Wouldn't it be enough to estimate an affine transform for a quad and blend the image of awning directly (and the same with shadow map to fake shade)? Is classical photogrammetry up to such task these days?
I'm quite perplexed by this comment. If I'm understanding you correctly, sure, what you describe is possible through significantly more effort, orchestration, and source photos. Or we can grab one still image and throw an inpainting model at it.
I have no idea but I think you might be onto something.

So you're saying that, if I can calculate from the picture the position (height, inclination and such), and I can render the model (should be doable) for that height and angle, my best course of action could be to combine original + render and only at the end use a visual model? That could be interesting.

flux klein with LoRa. GPT image and nano often produce high frequency artifacts when editing.
Unrelated but when I read inpainting and Moebius I was scared it was related and using the art of the great Jean Giraud [0] a.k.a. Moebius

https://characterdesignreferences.com/artist-of-the-week-3/m...

[0] https://en.wikipedia.org/wiki/Jean_Giraud

Scared why?
Scared for the same reason I found last year's 'Ghibli filter' craze upsetting, I would have personally hated to have seen this artist's legacy used for promoting AI image generation.
In case that happened then the rest of the world would probably appreciate the art, and a subset of it, the artist (and even a small subset of ~whole Internet-connected population is a lot of people). Some silver lining, perhaps.
> In case that happened then the rest of the world would probably appreciate the art

What art?

We’re talking about generated pictures, aka slop, not art made by a real human.

And I don’t know if you’ve been paying attention but people seem to be pretty tired of the slop. I don’t think it would be appreciated nearly as much as you think.

It is possible to use generative AI in nonslop ways btw
This definition of "slop" doesn't cut reality just quite at the joints.

People are tired of marketing. AI generated slop people are annoyed with, is garbage produced for marketing reasons, and it's distinctly noticeable precisely because all the bottom-feeder marketing houses switched to using it. But it's not the AI itself that's the problem here. Slop was here before, but it was made with cheap protein-based image generators. Silicon-based generators are just cheaper.

Could this run locally on a smartphone ?
lot of the photo editors on mobiles have this, maybe even some apps?
It sure has a thing for chins, jaws and removing weight, looksmaxing build in.
The gallery of their samples is pretty impressive!
1) What are RAM requirements?

2) If these are reasonable, a WebGPU demo would be great..

The total model size is about 1.2GB (UNet + SDXL VAE included), so probably about ~3GB?