I'm confused by that analogy. Is the “ruler” a 255-inch ruler with 256 points labeled 0–255, or is it a 256-inch ruler with 256 1-inch segments, making L = 256×1?
When you have a 12 inch ruler, you effectively have 13 numbers on the ruler. The fact that zero isn't marked is neither here nor there -- the numeral one is not at the far end of the ruler.
So if you extend the ruler to be as long as you can hold in eight bits, it will range from 0 to 255, and the total length will be 255.
The ruler analogy may seem overly simplistic, but then the real world is likewise fairly simplistic.
At the end of the day, the numbers presumably come from a sensor, or go to a display, and, often, in either case, zero represents as dark as you can get and 255 represents as light as you can get, so the physics dictate that the intervals associated with the 0 and 255 are half the size of the rest of the intervals.
Audio is more interesting than video, because in audio, you care deeply about not having an offset, and about having a balanced signal, so the question of whether the midpoint is actually on a number or not is pertinent.
In audio, it is often useful to simply discard a code so that 0 is the midpoint (e.g -65535 to +65535, discarding 0xFFFF). But this still gives you smaller intervals at both ends.
You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.
If your input is an arbitrary float, you need to check for denormals (and maybe NaNs). You can do bitmasking trick to avoid conditional jumps but I'm skeptical you can do it faster than SIMD multiply instruction.
Shift right isn't even relevant here - if you shift before conversion to float all your values end up 0 and if you want to divide afterwards its no longer a simple shift.
FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.
Only with things like -ffast-math enabled will compilers do the reciprocal.
It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.
That's not totally true. It's sufficient to be exactly representable, but you only need the reciprocal rounding error to be small enough to guarantee the multiplication rounding step fixes it across the entire range of numerators. For IEEE754 f16 values, there are 28 such extra values, the positive and negative sides of 1705/x where x is a power of 2 at least as great as 2048.
Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)
And both are wrong since the values would have to be in a linear color space for for the compositing math to make sense. But in some non-linear space to be useful when mapped to 0..255 (e.g non-linear sRGB).
Which happens right after the Porter-Duff Over operator above -- a smoking gun. Which one is it gonna be?
I.e. the display transform is omitted from this and the math involved with the latter makes your whole argument moot.
It can't be expressed well enough with bitshifts to keep your purported 10x speedup anyway (and which I strongly doubt btw).
And lastly: in a software renderer that stuff is usually <0.01% of the compute in the absolut worst case.
P.S.: I'm speaking from 30 years of experience with software rendering in the context of VFX.
If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.
Better than what? And do you use `-mavx2` or do you let it target baseline x86_64 and miss out on 8-float vectors? How do you make sure its autovectorisation is successful?
It doesn't even need to represent intervals. A 13 inch ruler with 13 markings at 0.5, 1.5, etc inches is still a valid ruler, albeit an odd construction.
What makes this argument less compelling is that “year 1 AD” also didn’t exist at the time, and this isn’t a great reason to abandon the arithmetically sane approach of zero-indexed year numbering.
The calendar was back-dated 500 or so years after Jesus, by a European guy before Europe had the concept of zero, leaving us with 1-indexed years. Then, 200 or so years after that, another guy (still lacking the concept of zero) made the even less venerable decision that the year right before 1 AD would be 1 BC.
We could just decide today that 0 came right before 1 AD and was the first year of the first century AD. Then we’d just have to shift all BC dates by 1 year in all our history books.
The upside would be that arithmetic on year labels starts working again. The downside is that there are way too many history books and no one will ever do this.
Of course, the easier way out is to just decide today that either 1) the first century began in 1 BC or 2) the first century had 1 fewer year than all the other centuries.