Hacker News new | ask | show | jobs
by gwern 4386 days ago
Compressing with something like gzip, xz, or zpaq gives you an upper bound on the complexity. This upper bound is pretty loose, but works surprisingly well in a surprising number of domains (not beating special-purpose algorithms usually, of course, but that something like gzip can be used to estimate eg phylogenetic trees at all is surprising).

See for example http://www.illc.uva.nl/Research/Publications/Dissertations/D... 'Statistical Inference Through Data Compression'.

1 comments

I am aware of the interesting duality between compression and learning. I have spent quite some time thinking about it.

However I am still not convinced that this upper bound is going to provide us with the information that we need in this case. What we are looking for is a relative measure of the complexity between each genome. The upper bound will not necessarily give us this relative measure because it may not be able to compress the genetic code of organisms by the same factor. The compressibility of a particular genome, by a specific algorithm will be dependent on the method of encoding of information used by the organism. For instance the organism may repeat codes for redundancy, but it may permute the letters of the copy in a predictable way, for it's own reasons. The compression algorithm used will not pick up on this.

It is useful to use compressiblity by a range of algorithms as the input to a machine learning algorithm, or as part of the model in AIXI, but it is not useful for estimating algorithmic complexity (as far as I am concerned, I am open minded, but not convinced yet).

> What we are looking for is a relative measure of the complexity between each genome. The upper bound will not necessarily give us this relative measure because it may not be able to compress the genetic code of organisms by the same factor. The compressibility of a particular genome, by a specific algorithm will be dependent on the method of encoding of information used by the organism. For instance the organism may repeat codes for redundancy, but it may permute the letters of the copy in a predictable way, for it's own reasons. The compression algorithm used will not pick up on this.

It may or may not. But an upper bound is still an upper bound, and turns out to be usable for many purposes.