As someone interested in the field, this does not look much different from state-of-the-art video codecs, e.g. variable-size blocks, wavelet, predicator, arithmetic coding. Only that the predicator is trained on real data. But symbol dictionaries are already used in modern compressors like brotli and zstd.
Most codecs have been tuned for a mean subjective score (MOS). MS-SSIM is not particularly good to fully rely on [1] [2]. In my experiments it performed poorly.
I think Google team effort will have a much bigger impact [3] just by combining all recent practical improvements in the image compression.
Meanwhile, they could have optimized images on the website a little better. Saved ~15% with my soon-to-be-obsolete tool [4].