| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by FullyFunctional 1645 days ago

Since we are rehashing this for the 3rd (4th?) time, I'll repeat mine (and apparently many others) key critique: there is no thought at all to enabling parallel decoding, be it, thread-parallel or SIMD (or both). That makes it very much a past millennium style format that will age very poorly.

At the very least, break it into chunks and add an offset directory header. I'm sure one could do something much better, but it's a start.

EDIT: typo

3 comments

stathibus 1645 days ago

Who cares that it's not set up for simd?

Seriously, who?

This project is interesting because of how well it does compared to other systems of much higher complexity and without optimizing the implementation to high heaven. We can all learn something from that.

FullyFunctional 1645 days ago

Good question. The answer is all the poor souls that N years later find themselves stuck with a data in a legacy format that they have to struggle to decode faster.

Of all the artifacts in our industry, few things live longer than formats. Eg. we are still unpacking tar files (Tape ARchieve), transmitted over IPv4, decoded by machines running x86 processors (and others, sure). All of these formats couldn't possible anticipate the evolution that follow nor predicted the explosive popularity they would have. And all of these (the latter two notably) have overheads that have real material costs. IPv6 fixed all the misaligned fields, but IPv4 is still dominant. Ironically, RISC-V didn't learn from x86 but added variable length instructions making decoding harder to scale than necessary.

I'm not sure what positive lessons you think we should learn from QOI. It's not hard to come up with simple formats. It's much harder coming up with a format that learns from past failures and avoids future pitfalls.

ricardobeat 1645 days ago

QOI is designed with a very specific purpose in mind, which is fast decoding for games. This kind of image will be very unlikely be large enough to benefit from multi threading, and if you have a lot of them you can simply decode in parallel. It’s not meant to the the “best” image format.

nynx 1645 days ago

Unrelated to the rest of your comment, but risc-v does not have variable-length instructions. It has compressed instructions, but they're designed in such a way to be easily and efficiently integrated into the decoder for normal instructions, which are all 32 bits.

FullyFunctional 1645 days ago

My day job for 6+ years is implementing high perf RISC-V cores and my name is in many of the RISC-V specs.

Variable length ISAs are characterized by not being able to tell the beginning of an instruction without knowing the entrypoint. This applies to RISC-V with compressed instructions. Finding the boundaries is akin to a prefix scan and has a cost roughly linear in the scan length, but IMO the biggest loss is that you can’t begin predecode at I$ fill time.

gautamcgoel 1645 days ago

It sounds like you regret the decision to make RISC-V variable length. Is that correct?

FullyFunctional 1645 days ago

I fought against making the _current_ way to do compressed instructions a mandated part of the Unix profile, but RISC-V was (at least at the time) dominated by microcontroller people and there was a lack of appreciation of the damage it incurred. A lot of people far more senior than me couldn't believe what happened.

Interesting to contrast with Arm which upon defining Aarch64 did _away_ with variable length instructions and thus also page crossing ones. Maybe they knew something.

GoblinSlayer 1645 days ago

Can't you predecode speculatively, then redecode if you see a compressed instruction? Also I assume the bottleneck there is instruction cache, no?

Dylan16807 1645 days ago

> IMO the biggest loss is that you can’t begin predecode at I$ fill time.

That helps enough to overcome the increased code size?

I really wouldn't say they learned nothing from x86, though. You only have to look at 2 bits, and if you can get your users to put in the slightest effort then compilers can be told not to use C.

FullyFunctional 1645 days ago

That's a false strawman. There are infinitely many ways to achieve the same or better density without the drawback. Allowing instruction to span cache line, or even pages, is a mistake that we'll pay for forever.

The simplest possible mitigation would have been to disallow an instruction from spanning a 64-byte boundary. It would have almost no impact on instruction density, but it would have saved a lot of headaches for implementations.

rndgermandude 1645 days ago

Those poor souls N years later will either have to decode a very few images, which is still fast enough, or decode a lot of images, which can be parallelized and run concurrently on a per-image level. In the very worst case, decode an extremely large single image, you're a bit out of luck, but that case would be rare, and you're still pretty fast at decoding anyway.

Creating formats and specs that are "future proof" is a noble goal. Criticizing QOI for not being able to be well parallelized inside the decode function, that seems more like a demand for a premature optimization to me...

seoaeu 1645 days ago

> Criticizing QOI for not being able to be well parallelized inside the decode function, that seems more like a demand for a premature optimization to me

What? Faster encoding and decoding is one of the primary reasons for the format. Yet, QOI decoders are currently an order of magnitude slower than SSDs available today and even worse compared to DRAM! Now seems like the perfect time to look at possible optimizations to close that gap.

flohofwoe 1645 days ago

QOI is not an interchange file format like PNG or JPG, it's more akin to DDS or KTX (e.g. a specialized game asset pipeline file format which doesn't require a complex dependency for decoding).

GoblinSlayer 1645 days ago

Who struggles to decode images faster?

midjji 1645 days ago

A surprisingly good image format is to use a per line, or block, ar encoder, then compress the result with gzip on a low setting. It parallelizes very nicely, and beats png for encode, decode, and is trivial to implement.

ErikCorry 1645 days ago

You should be able to do the same technique, but in a PNG-compatible format.

midjji 1645 days ago

I'm really not a fan of using the generally unsupported aspects of the file formats. Like sure, png supports alternate compression schemes which in combination with the fully generic datablocks system which I think technically accidentally makes it Turing complete. But its not like any reader ever supports it fully, hell most readers dont support showing the other images in a png file. Its also quite common that palettes aren't properly supported, in particular for uncommon combination. Palettes themselves are also not truly sufficient for what they are supposed to do, as that would require generic scalar to scalar functions. Its like this with every damned old image format. People try to make the format to end all formats and end up accidentally inventing really shitty programming languages with terrible separation of concerns which are terribly supported but work as people only ever use them for single image data.

Another stranger mistake is the use of generic compression a part of the format, its much better to leave that up to the filesystem or stream. I dont really understand why this was ever a good idea, but it certainly hasn't been one for decades.

The more recent blob formats people have developed aren't much better, they still confuse the specification of a single blob with the specification of the container format, and try to be convenient and fully generic at once. If people actually wanted a fully generic dataformat and accepted that this requires a programming language, just let that be python, instead of inventing some shitty new one...

spider-mario 1645 days ago

> But its not like any reader ever supports it fully, hell most readers dont support showing the other images in a png file.

Do you mean APNG? In all fairness, that is not even in the specification although there is discussion to add it. https://github.com/w3c/PNG-spec/issues/26

The one that was specified was MNG but it was a different format and practically no one used it since it was not parsable as a PNG.

midjji 1644 days ago

No not animated, I mean that the .png format specifies how to specify an arbitrary number of datablocks, without specifying how they should be used aside from amounts to an recommendation of which one should be shown.

It used by some videogames as assets, for instance storing what amounts as a thumbnail as a shown image, but leaving meshes and textures in the other internal datablocks.

ErikCorry 1645 days ago

I think I don't understand what you mean by an AR encoder

midjji 1644 days ago

Its auto regression. Think of a row of pixels as a signal y[n]. Assume y[n] = a y[n-1] + b y[n-2]. estimate a,b. Then store y[0], y[1], a, b, and the residual, then encode the four values and the residual which will mostly be zeros. You can vary the length and number of coefficients very cheaply, but a fixed two beat png in my tests. There are fast standard algorithms to find a,b and which along with some simple tricks to correct for the precision ensure lossless encoding. Its the mathematicians version of the ad hoc thing png tries to do.

sushibowl 1645 days ago

isn't that very similar to what PNG does? I'm not sure what you mean by "ar encoder", but PNG uses a per-line filter and then adds DEFLATE on top of that.

meltedcapacitor 1645 days ago

A thread can scan the opcodes only to find cut-off points and distribute actual decoding to other cores. Surely you can do that with some simd magic, as well as the decoding threads, without needing to encode properties of today's simd in the encoding.

seoaeu 1645 days ago

No it can't. The encoder doesn't insert any "cut-off points". In fact, nearly every chunk encodes the current pixel value in terms of one of the previous pixel values, so without knowing those it is impossible for a second core to start up and initialize its decoder state enough to produce correct output.

meltedcapacitor 1645 days ago

Read again the proposal.

A top level thread scans the opcodes only to solve this, with no decoding and no writing, thus progressing faster in the stream than the child chunked decoding threads it progressively spawns.

Not as quick as a format with a chunk table, but faster than naive single core.

seoaeu 1645 days ago

I did read your comment. You don’t have any explanation of how the top level scan can maintain the color index array for QOI_OP_INDEX chunks without doing any decoding.

flohofwoe 1645 days ago

I bet you can split big images into smaller QOI encoded chunks and decode those in parallel.

QOI is simple enough to remix the file format as much as you want in your own encoder/decoder (that's actually the USP), it's not meant as a standardized image exchange format, just something that lives in your own asset pipeline.