Hacker News new | ask | show | jobs
by simias 1046 days ago
Ten years or so ago I was working on a video chip that had an upscaler feature. While prototyping and simulating it, we first started by applying a mathematically-correct (i.e. information preserving) FIR filter to do the upscale. Then we compared the result with other solutions and found that ours looked worse. We asked our colleagues to blind-test it and they all picked third-party-scaled images over ours.

At first we assumed that we must have had a bug somewhere because the Fourrier transform told us that our approach was optimal, but after more testing everything matched the expected output. Yet it looked worse.

So we started reverse-engineering the other solutions and, long story short, what they did better is that they added some form of edge-enhancement to the upscaling. Information-theory-wise it actually degraded the image, but subjectively the sharper outlines were just so much nicer to look at and looked correct-er. You felt like you could more easily tell the details even though, again, in a mathematical sense you actually lost information that way.

I don't think it makes a lot of sense to reduce human vision to edge detection (we can still make sense of a blurry image like this one after all: https://static0.makeuseofimages.com/wordpress/wp-content/upl... ) but it's clear to me from empirical evidence that edge-detection is a core aspect of how we parse visual stimuli.

As such I'm a bit confused as to why the author seems to see this as a binary proposition. That being said, I could just be misunderstanding completely the point the author is trying to make.

10 comments

I don't think it's just subjective in this case. The theoretical signal processing approach assumes that the signal is band-limited to frequencies less than two pixels wide, and it's not. There are lots of sharp edges that have higher frequency components than that.

Another way of looking it, more along the lines that you're talking about, is that it depends on your error model. The traditional way of measuring error is RMS pointwise in pixels. Doing some sort of interpolation on pixels gives a pretty good result for that. However, another way to look at it is that it may be better to have a positional error, i.e. a particular color or intensity level is in the wrong spot, than to have an intensity/color error, i.e. you have a pixel that has an intensity/color that's not present in the source signal.

This same basic issue was the basis of a big divide in font rendering for many years, where the Mac would render fonts with the exact geometry of the letters, but then anti-aliased, while Windows would use font hinting to make the shape snap to the pixel grid. Personally I thought that the Windows approach was a lot easier to read on a screen, but the Mac approach had the advantage that the geometry of the text would be exactly the same in print as it was on the screen, back when print was something that was important, especially for Mac users.

Sounds like using the wrong metric, the upscaled image should be compared against the original full resolution one, not the downscaled one. Obviously you can't know what the full resolution one looks like when actually upscaling (vs testing), but you can make an educated guess.
You can remove the guesswork. You can start with a high resolution (or even raster export of a vector image for an extreme example), downsample it (various methods and downsampling factors for completeness), then attempt to upscale it.
That's exactly what GP is talking about, the 'guesswork' comes in when you upscale as a function of only the downsampled version.
The difference between the data in the image, and the information? If for instance you upscaled text so large that it became blurry and unrecognizable, you lost information.

Our cortext is all about interpreting what we see. Almost before our brain proper has the data, nerves have begun extracting information (edges etc). Probably because it was the difference between hitting and missing the animal with the spear. Or seeing or missing the tiger in the grass.

Precisely! I also find it interesting how, from an information theory standpoint, audio processing and image processing are effectively the same thing (audio resampling is fundamentally 1D image scaling for instance) but because humans process sounds very differently from images we end up doing things pretty differently.

For instance when we want to subjectively make images more attractive we tend increase contrast and sharpness, whereas for sound we tend to compress it, effectively reducing "audio contrast".

The old habit of reaching to “increase contrast”[0] as a means of making an image more attractive exists in large part because 1) the dynamic range of modern display media is so tiny compared to the dynamic range of camera sensors and our eyes[1], and 2) the images most people typically work with are often recorded in that same tiny dynamic range.

If you work with raw photography, you will find that, as with audio, the dynamic range is substantially wider than the comfortable range of the available media: your job is, in fact, to compress that range into the tiny display space while strategically attenuating and accentuating various components—just like with raw audio, much more goes into it than merely compression, but fundamentally the approaches are much alike.

[0] Which actually does much more than that—the process is far from simply making the high values higher and low values lower.

[1] Though “dynamic range” is much less of a useful concept when applied to eyes—as with sound, we perceive light in temporal context.

> I'm a bit confused as to why the author seems to see this as a binary proposition

The author mentions this twice:

> This hypothesis is compatible with Lines-As-Edges, while answering many of these questions.

Surely if you're upscaling pixel art you're loosing information when you create gradients between pixels. It doesn't seem to me that your metric of information loss was ideal.
Conservation is not just about preserving info, it's also about not adding information that's not here. If you upscale without those gradients (effectively sharpening to the max with nearest neighbor extrapolation) you introduce high frequencies that could not exist in the original data. You've created new information out of nowhere.

But of course you're correct that in this case it may be the desirable outcome. I still think that this idea of creating information using algorithms in order to get a subjectively more pleasant result is really one of the biggest issues of our time. Not a day passes where I don't see AI-colorized pictures, AI-extrapolated video footage, AI-cleaned family portraits, AI-improved smartphone footage etc...

It's both amazing and a bit scary, because in a certain way we rewrite history when we do this, and since the information is not present in the original it's very difficult to ascertain how close we truly are to reality. We're creating a parallel reality, one Instagram filter at a time. Maybe that's the true metaverse.

> in a certain way we rewrite history when we do this

History is sort of inherently rewritten. Memories are (very!) imperfect and even without realizing it we interpret events through our individual biases. Maybe the more precise concern is the increasing _willful_ departure from reality, but we do that naturally too, overlooking parts of reality that would be intolerable if they were always in our face.

Quite, no new information so no “loss” but not the information that needs to be there.

It’s putting an 8oz coffee brew in a 20oz cup & giving it to the customer as a large saying they had no coffee loss. While true, it’s not the same as delivering a 20oz coffee.

The upscaled image stores more information than the original image, so it must be possible to keep all the information while still doing edge enhancement!
Stores more data, but the same information, and if there is any interpolation then some of the data is modified, meaning that you lose a little data. In fact even without interpolation I think you change the data.

If you imagine a hard edge that aligns with a pixel [] boundary then you imagine upscaling in various ways, I think it's QED. You change data about the sharpness of the edge.

[] I use an Android phone, with Google's keyboard it genuinely rendered "pixel" with a capital letter. I've never written about Google's device of that name. Silly Google.

Well consider nearest neighbor upscaling. Since we are upscaling, every pixel in the source image will determine one our more pixels in the result image. Consider one of the source pixels that turns into multiple result pixels. If you manipulate one of those while leaving the other(s) intact you can still recover the source image (assuming you know which pixels are still good), meaning no data was lost.
Surely anything other than duplicating each pixel x times both horizontally and vertically (so 1 turns into 4, or 9, or 16, ...) adds information?

(This submission is going to have me reaching for my old textbooks.. about time really!)

No. Nearest neighbor, bilinear bicubic etc. is just encoding the same information in different ways.

You could add noise, or generate new details with an ai upscaler. That would create new information.

Ah, right, anything that's a function purely of what's in the image - no randomness, no external context/'knowledge' to interpret it semantically, is as you say 'encoding the same information in different ways'?

If it can be computed (deterministically) from the image alone, then it was already there.

Works in games too. Adaptive contrast will increase noise, but after a while games without it will look blurry and undefined.
It's like playing a familiar song on an old cassette tape on a modern system. The user will likely be tempted to crank up the treble, trying to recover content that isn't there. In the image-scaling example, the HF content wasn't there to begin with -- since you upsampled it, it couldn't have been! -- but there's a strong psychological expectation for it to be there.

With images, it's a bit easier to put the high frequencies back in via judicious use of edge enhancement, perhaps because you have two dimensions to work with rather than one dimension in the audio case.

A lot of smart TVs do that. Being able to spot it (played a lot with edge detection when discovering computer vision) is a curse.
I would put it like this:

Sharpening increases the high frequencies to cover the loss of even higher ones lost in downscaling.

Yes. Similar to pre emphasis in baseband telecommunication standards in band limited media.