But doesn't the size in benchmarks include the size of the binary decoder ? So the embedded trained data is accounted for (preventing a plain copy of wikipedia to be included in the decoder)
I don't think the compressed size statistics on this webpage include the size of the LLM needed to decode. Some of these inputs are only a few 100 kB -- LLMs absolutely dwarf that.