| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dudu24 17 days ago
	If you have a ruler and it goes to 12 inches, you should normalize by the length L and not by 13, the number of points on the ruler.

6 comments

Timwi 17 days ago

I'm confused by that analogy. Is the “ruler” a 255-inch ruler with 256 points labeled 0–255, or is it a 256-inch ruler with 256 1-inch segments, making L = 256×1?

link

zephen 17 days ago

The analogy is pretty straightforward.

When you have a 12 inch ruler, you effectively have 13 numbers on the ruler. The fact that zero isn't marked is neither here nor there -- the numeral one is not at the far end of the ruler.

So if you extend the ruler to be as long as you can hold in eight bits, it will range from 0 to 255, and the total length will be 255.

The ruler analogy may seem overly simplistic, but then the real world is likewise fairly simplistic.

At the end of the day, the numbers presumably come from a sensor, or go to a display, and, often, in either case, zero represents as dark as you can get and 255 represents as light as you can get, so the physics dictate that the intervals associated with the 0 and 255 are half the size of the rest of the intervals.

Audio is more interesting than video, because in audio, you care deeply about not having an offset, and about having a balanced signal, so the question of whether the midpoint is actually on a number or not is pertinent.

In audio, it is often useful to simply discard a code so that 0 is the midpoint (e.g -65535 to +65535, discarding 0xFFFF). But this still gives you smaller intervals at both ends.

link

knappa 16 days ago

Fencepost errors aren't errors if you are actually trying to count fenceposts.

link

lacedeconstruct 17 days ago

yes but >> 8 is so much faster

link

xigoi 17 days ago

You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.

link

dheera 17 days ago

Same point; divide by power of 2 is a fast subtraction operation in float world, while divide by 255 shits all over the whole float

link

yongjik 17 days ago

If your input is an arbitrary float, you need to check for denormals (and maybe NaNs). You can do bitmasking trick to avoid conditional jumps but I'm skeptical you can do it faster than SIMD multiply instruction.

link

StilesCrisis 17 days ago

It's just multiplication. Floating multiply is extraordinarily fast.

link

lacedeconstruct 17 days ago

The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable

link

exyi 17 days ago

It's 3 cycles for float multiplication (and 1 for shift right):

https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...

https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

link

account42 16 days ago

Shift right isn't even relevant here - if you shift before conversion to float all your values end up 0 and if you want to divide afterwards its no longer a simple shift.

link

exyi 16 days ago

Exactly. Although if you do >> 8 while working with uint8, it will be the fastest :)

link

userbinator 17 days ago

It's 3 cycles for float multiplication (and 1 for shift right):

3x faster

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

50% faster

link

Tuna-Fish 17 days ago

FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.

link

pixelesque 17 days ago

Only with things like -ffast-math enabled will compilers do the reciprocal. It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.

link

mgaunard 17 days ago

That's only valid to do if the reciprocal is representable exactly.

link

hansvm 17 days ago

That's not totally true. It's sufficient to be exactly representable, but you only need the reciprocal rounding error to be small enough to guarantee the multiplication rounding step fixes it across the entire range of numerators. For IEEE754 f16 values, there are 28 such extra values, the positive and negative sides of 1705/x where x is a power of 2 at least as great as 2048.

link

Sesse__ 17 days ago

Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)

link

dist-epoch 17 days ago

Only in micro-benchmarks.

For real usage, today's CPUs are limited by memory bandwidth.

link

lacedeconstruct 17 days ago

What are you talking about in a hot loop in my software renderer this is like 10x faster

    // color4_t result = {
    //     .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
    //     .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
    //     .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
    //     .a = src.a + (dst.a * inv_alpha) * INV_255
    // };

    // 1/256 but much faster
    color4_t result = {
        .r = (src.r * src.a + dst.r * inv_alpha) >> 8,
        .g = (src.g * src.a + dst.g * inv_alpha) >> 8,
        .b = (src.b * src.a + dst.b * inv_alpha) >> 8,
        .a = src.a + ((dst.a * inv_alpha) >> 8)
    };

link

virtualritz 17 days ago

And both are wrong since the values would have to be in a linear color space for for the compositing math to make sense. But in some non-linear space to be useful when mapped to 0..255 (e.g non-linear sRGB).

Which happens right after the Porter-Duff Over operator above -- a smoking gun. Which one is it gonna be?

I.e. the display transform is omitted from this and the math involved with the latter makes your whole argument moot.

It can't be expressed well enough with bitshifts to keep your purported 10x speedup anyway (and which I strongly doubt btw).

And lastly: in a software renderer that stuff is usually <0.01% of the compute in the absolut worst case.

P.S.: I'm speaking from 30 years of experience with software rendering in the context of VFX.

link

Tuna-Fish 17 days ago

If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.

link

dist-epoch 17 days ago

Because you are working in the cache.

Also, you should use SIMD.

link

lacedeconstruct 17 days ago

> Also, you should use SIMD. ironically no clang is better at auto vectorizing

link

spider-mario 17 days ago

Better than what? And do you use `-mavx2` or do you let it target baseline x86_64 and miss out on 8-float vectors? How do you make sure its autovectorisation is successful?

link

imtringued 17 days ago

How is this supposed to be 10x faster if all you did was drop one out of three multiplications?

link

layer8 17 days ago

But who says that the numbers are representing the points, rather than representing the intervals between the points?

link

wky 17 days ago

It doesn't even need to represent intervals. A 13 inch ruler with 13 markings at 0.5, 1.5, etc inches is still a valid ruler, albeit an odd construction.

link

groundzeros2015 17 days ago

I’m dumb. Doesn’t 0 start at the beginning?

link

dylan604 17 days ago

It's right up there with the confusion if 2000 was the new year of the 21st century or the last year of the 19th century.

link

simonask 17 days ago

For the record, the mathematically correct answer to this question is that the year 2000 was the last year of the 19th century.

The reason is that year 0 never existed. The year 1 BCE was followed by the year 1 CE.

Culturally, anthropologically, and psychologically it might be a different matter. But 2000 years had not passed before the end of that year.

link

tshaddox 17 days ago

What makes this argument less compelling is that “year 1 AD” also didn’t exist at the time, and this isn’t a great reason to abandon the arithmetically sane approach of zero-indexed year numbering.

The calendar was back-dated 500 or so years after Jesus, by a European guy before Europe had the concept of zero, leaving us with 1-indexed years. Then, 200 or so years after that, another guy (still lacking the concept of zero) made the even less venerable decision that the year right before 1 AD would be 1 BC.

We could just decide today that 0 came right before 1 AD and was the first year of the first century AD. Then we’d just have to shift all BC dates by 1 year in all our history books.

The upside would be that arithmetic on year labels starts working again. The downside is that there are way too many history books and no one will ever do this.

Of course, the easier way out is to just decide today that either 1) the first century began in 1 BC or 2) the first century had 1 fewer year than all the other centuries.

link

account42 16 days ago

We could also just define that 0 AD = 1 BC and don't have to rewrite any BC dates.

link

tzot 17 days ago

The debate is if 2000 is the first year of the 21st century or the last year of the 20th century. (btw I agree with the latter)

link

dylan604 17 days ago

wow, yeah, that's quite the miss on my part.

link

m463 17 days ago

the correct way is to use a slide rule

link