Hacker News new | ask | show | jobs
by andai 132 days ago
If I'm reading this right, you're saying it's functionally equivalent to measuring the intersection of ngrams? That sounds very testable.
1 comments

Mostly. There's also confounding effects from factors like the length of the texts - e.g. when compressing Zstd(A+B), it's more expensive to encode a backreference in B to some content in A when the distance to that content is longer, so longer texts will appear less similar to each other than short texts.