Hacker News new | ask | show | jobs
by stathibus 1645 days ago
Who cares that it's not set up for simd?

Seriously, who?

This project is interesting because of how well it does compared to other systems of much higher complexity and without optimizing the implementation to high heaven. We can all learn something from that.

1 comments

Good question. The answer is all the poor souls that N years later find themselves stuck with a data in a legacy format that they have to struggle to decode faster.

Of all the artifacts in our industry, few things live longer than formats. Eg. we are still unpacking tar files (Tape ARchieve), transmitted over IPv4, decoded by machines running x86 processors (and others, sure). All of these formats couldn't possible anticipate the evolution that follow nor predicted the explosive popularity they would have. And all of these (the latter two notably) have overheads that have real material costs. IPv6 fixed all the misaligned fields, but IPv4 is still dominant. Ironically, RISC-V didn't learn from x86 but added variable length instructions making decoding harder to scale than necessary.

I'm not sure what positive lessons you think we should learn from QOI. It's not hard to come up with simple formats. It's much harder coming up with a format that learns from past failures and avoids future pitfalls.

QOI is designed with a very specific purpose in mind, which is fast decoding for games. This kind of image will be very unlikely be large enough to benefit from multi threading, and if you have a lot of them you can simply decode in parallel. It’s not meant to the the “best” image format.
Unrelated to the rest of your comment, but risc-v does not have variable-length instructions. It has compressed instructions, but they're designed in such a way to be easily and efficiently integrated into the decoder for normal instructions, which are all 32 bits.
My day job for 6+ years is implementing high perf RISC-V cores and my name is in many of the RISC-V specs.

Variable length ISAs are characterized by not being able to tell the beginning of an instruction without knowing the entrypoint. This applies to RISC-V with compressed instructions. Finding the boundaries is akin to a prefix scan and has a cost roughly linear in the scan length, but IMO the biggest loss is that you can’t begin predecode at I$ fill time.

It sounds like you regret the decision to make RISC-V variable length. Is that correct?
I fought against making the _current_ way to do compressed instructions a mandated part of the Unix profile, but RISC-V was (at least at the time) dominated by microcontroller people and there was a lack of appreciation of the damage it incurred. A lot of people far more senior than me couldn't believe what happened.

Interesting to contrast with Arm which upon defining Aarch64 did _away_ with variable length instructions and thus also page crossing ones. Maybe they knew something.

Can't you predecode speculatively, then redecode if you see a compressed instruction? Also I assume the bottleneck there is instruction cache, no?
> IMO the biggest loss is that you can’t begin predecode at I$ fill time.

That helps enough to overcome the increased code size?

I really wouldn't say they learned nothing from x86, though. You only have to look at 2 bits, and if you can get your users to put in the slightest effort then compilers can be told not to use C.

That's a false strawman. There are infinitely many ways to achieve the same or better density without the drawback. Allowing instruction to span cache line, or even pages, is a mistake that we'll pay for forever.

The simplest possible mitigation would have been to disallow an instruction from spanning a 64-byte boundary. It would have almost no impact on instruction density, but it would have saved a lot of headaches for implementations.

Strawman? I wasn't even trying to characterize anyone else's point, I was just trying to list some significant improvements over x86.

> The simplest possible mitigation would have been to disallow an instruction from spanning a 64-byte boundary.

Sure, that sounds good. But before this you hadn't even mentioned any problems with split instructions that need to be mitigated.

(You did mention decoding without a known entry point, but a rule like that doesn't guarantee you can find the start of an instruction. And if it would help to know that a block of 64 bytes probably starts with an aligned instruction, that seems like something you could work out with compiler writers even without a spec.)

Those poor souls N years later will either have to decode a very few images, which is still fast enough, or decode a lot of images, which can be parallelized and run concurrently on a per-image level. In the very worst case, decode an extremely large single image, you're a bit out of luck, but that case would be rare, and you're still pretty fast at decoding anyway.

Creating formats and specs that are "future proof" is a noble goal. Criticizing QOI for not being able to be well parallelized inside the decode function, that seems more like a demand for a premature optimization to me...

> Criticizing QOI for not being able to be well parallelized inside the decode function, that seems more like a demand for a premature optimization to me

What? Faster encoding and decoding is one of the primary reasons for the format. Yet, QOI decoders are currently an order of magnitude slower than SSDs available today and even worse compared to DRAM! Now seems like the perfect time to look at possible optimizations to close that gap.

QOI is not an interchange file format like PNG or JPG, it's more akin to DDS or KTX (e.g. a specialized game asset pipeline file format which doesn't require a complex dependency for decoding).
Who struggles to decode images faster?