Hacker News new | ask | show | jobs
by chrismorgan 1825 days ago
So, it’s using a 45-character alphabet which matches the QR code alphanumeric values table, which lets the QR code encoder switch to a more efficient mode that takes less space.

I just tried rendering a QR code of the 692 characters of the introductory paragraph (with lines joined appropriately), and compared it with a QR code of the same text, uppercased and with out-of-range characters `,`, `[` and `]` changed to %. This reduced an 89×89 code down to 77×77, a 24% reduction in area. If this is roughly the ratio, then Base45-encoding binary data by QR code will yield roughly 17% area savings compared with Base64. (Base45 gets 50% bloat, then 24% shrink = multiple of 1.14; Base64 gets 33% bloat = multiple of 1.33; Base45 / Base64 = 1.14 / 1.33 = 0.83.)

[Edit: edflsafoiewq’s figures on 40-L codes at https://news.ycombinator.com/item?id=27627915 come to about 23% savings, markedly more than my 17%.]

I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.

Of course, the specs for all this are ISO specs, so I can’t read them without coughing up the moolah.

In case it’s not clear, I am utterly inexpert in this domain.

Further thoughts:

Base45 encoding two octets in three characters is pretty wasteful: 45³ ÷ 256² ≈ 1.39, which is 39% waste. (By contrast, Base64 is 100% efficient with its alphabet: 64⁴ = 256³.) This means that if you were willing to do more complex encoding and decoding, you could shrink your QR code by roughly 39% more—to about 52% of the size of the Base64, rather than 83%. Leaving such a huge gap on the table puzzles me—I’d have thought that either you’d want something simple (where Base64 is well-understood) or want to minimise your QR codes, and Base45 sits in an awkward place in the middle.

For UTF-8, base-128 will be the most efficient you get. That’ll be ~14% inflation (7 bytes in 8 characters). Which… huh, that looks to be within ε of Base45’s 50% bloat and 24% shrinkage. Not sure if that’s a coincidence or not because I don’t know how alphanumeric mode versus byte mode works in QR codes. But this suggests that alphanumeric mode and Base45-but-not-wasteful would be markedly more efficient than byte mode and Base-122. Still leaves numeric and kanji modes open as possibilities. Again, I’m inexpert and don’t know how the encodings are actually done, and that’ll matter.

On edflsafoiewq’s 40-L figures: Base64 gets 2214 bytes, Base45 gets 2864 bytes, optimally-efficient base-45 would get log₂₅₆ 45⁴²⁹⁶ = 2949 bytes, only around 3% more. I think I must have made a mistake somewhere with some of my numbers.

2 comments

The right way to measure the area ratio is using entropy. An optimal encoding would save at most 3% area over Base45:

Base64 in binary: log₂ 64 / 8 = 75.000% efficient

Base45 in alphanumeric: 4 log₂ 256 / 33 = 96.970% efficient

Optimal numeric: 3 log₂ 10 / 10 = 99.657% efficient

Optimal alphanumeric: 2 log₂ 45 / 11 = 99.851% efficient

Optimal binary (ISO 8859-1): log₂ 191 / 8 = 94.718% efficient

Optimal binary (UTF-8, single-byte subset): log₂ 128 / 8 = 87.500% efficient

Optimal binary (UTF-8, full): 1 = 128 / 2^(8α) + 1920 / 2^(16α) + 63488 / 2^(24α) + 1048576 / 2^(32α) ⇒ α = 89.706% efficient

Optimal kanji (JIS X 0208): log₂ 6879 / 13 = 98.061% efficient

The mistake in your 39% calculation is that you forgot to take logarithms before calculating the ratio.

Ah hah, yes, the 39% was linear but needed to be log. Thanks for that, and all the other figures too.
> I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.

Byte mode is obviously most efficient. Alphanumeric mode uses 5.5 bits per each character, so combined with base45 it uses 5.5 * 3 = 16.5 bits to pack two octets. Base45 is actually not that bad as it seems (3.1% overhead). [EDIT: I've since seen edflsafoiewq's mention that typical QR readers try to interpret binary data as UTF-8, so it is not pointless.]

[MORE EDIT: I've completely missed the last paragraph, so I was actually just confirming what chrismorgan said.]

> Maybe a compatibility issue?

From the introduction:

> Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.

Back to your comment:

> Base45 is actually not that bad as it seems (3.1% overhead)

That’s very close to my logarithm calculations from edflsafoiewq’s 40-L figures (the log₂₅₆ 45⁴²⁹⁶ bit of my comment, just added). Would you mind explaining to me what I did wrong with my 45³ ÷ 256² calculation that said 39%?

Oops, I've completely missed the last paragraph and thought that you were giving a wildly larger figure with a wrong assumption. Your actual assumption for optimally-efficient base-45 is correct and close enough to the actual QR code spec.
I’ve been adding bits to my comment as I went. It wasn’t there when you wrote your comment. I’m stopping now and going to go eat, confused about the 39% and 3% and what’s going on. I haven’t worked with QR codes enough!