It's simply better suited for some types of images than others (e.g. the resulting size is sometimes bigger than expected). The main advantage is the very simple encoder and decoder with a specification that fits on a single page (and which still yields surprisingly good results for many image types):
I agree that it is quite easy to grasp the format in terms of implementation.
It seems basically like writing a image VM that accepts byte code. I think that could really be a way to specify many file formats more concicesly. If e.g. you chose the correct automata/transducer class one can easily e.g. specify some hedge grammar based XML file format and get a binary representation. Starting from grammars as a spec it is typically more difficult if you want to derive an implementation.
However I e.g. wonder from reading the concrete spec why you e.g. cannot differentially change the alpha channel leading me to the question what happens if images have different alpha levels.
"Everything" means two 32-bit integer values (width and height) in the header, that's hardly much of a downside ;)
Usually it's a good idea anyway to read file headers byte by byte instead of mapping a struct over it to avoid alignment, padding and endianness issues.
https://qoiformat.org/qoi-specification.pdf