Hacker News new | ask | show | jobs
by fzzt 1364 days ago
The prospect of the images getting "structurally" garbled in unpredictable ways would probably limit real-world applications: https://miro.medium.com/max/4800/1*RCG7lcPNGAUnpkeSsYGGbg.pn...

There's something to be said about compression algorithms being predictable, deterministic, and only capable of introducing defects that stand out as compression artifacts.

Plus, decoding performance and power consumption matters, especially on mobile devices (which also happens be the setting where bandwidth gains are most meaningful).

5 comments

While that is kind of true it is also sort of the point.

The optimal lossy compression algorithm would be based on humans as a target. it would remove details that we wouldn't notice to reduce the target size. If you show me a photo of a face in front of some grass the optimal solution would likely be to reproduce that face in high detail but replace the grass with "stock imagery".

I guess it comes down to what is important. In the past algorithms were focused on visual perception, but maybe we are getting so good at convincingly removing unnecessary detail that we need to spend more time teaching the compressor what details are important. For example if I know the person in the grass preserving the face is important. If I don't know them then it could be replaced by a stock face as well. Maybe the optimal compression of a crowd of people is the 2 faces of people I know preserved accurately and the rest replaced with "stock" faces.

Remember the Xerox scan-to-email scandal in which tiling compression was replacing numbers in structural drawings? We're talking about similar repercussions here.
This reminds me of a question I have about SD: why can’t it do a simple OCR to know those are characters not random shapes? It’s baffling that neither SD nor DE2 have any understanding of the content they produce.
You could certainly apply a “duct tape” solution like that, but the issue is that neural networks were developed to replace what were previously entire solutions built on a “duct tape” collection of rule-based approaches (see the early attempts at image recognition). So it would be nice to solve the problem in a more general way.
> why can’t it do a simple OCR to know those are characters not random shapes?

It's pretty easy to add this if you wanted to.

But a better method would be to fine tune on a bunch of machine-generated images of words if you want your model to be good at generating characters. You'll need to consider which of the many Unicode character sets you want your model to specialize in though.

With compression you often make a prediction then delta off of it. A structurally garbled one could be discarded or just result in a worse baseline for the delta.
Just a note that stable diffusion is/can be deterministic (if set an rng seed).
I was told (on the Unstable Diffusion discord, so this info might not be reliable) that even with using the same seed the results will differ if the model is running on a different GPU. This was also my experience when I couldn't reproduce the results generated by the discord's SD txt2img generating bot.
I'm not sure about the different GPU issue. But if that is an issue, the model can be made deterministic (probably compromising inference speed), by making sure the calculations are computed deterministically.
It absolutely should be reproducable, and in my experience it is.

I do tend to use the HuggingFace version though.