Hacker News new | ask | show | jobs
by ssdfsdf 4396 days ago
Why would you want to be compressing the string? For this to be useful you would need to be searching for the algorithmic complexity of the string, not merely how compressible it happens to be with, say, huffman encoding. Finding the algorithmic complexity of the string will be intractable. I would consider reasoning in the following way:

There is selective pressure to reduce the length of the dna within an organism, since maintaining dna is costly for the organism. So one might suppose that on the average, over time that an organism will only contain the length of dna which is necessary, give or take.

Of course it might be that the encoding of the information changes over time, so that the organism is able to store more information in a the same length. Or possibly less information in the same length, for reasons of benefit to the organism.

This needs careful reasoning, I am not convinced either my approach, or yours is enough.

2 comments

The idea that DNA length is costly to the organism is a venerable, but increasingly controversial theory. The existence of disproportionally huge genomes belies the hypothesis. While it's possible that some organisms derive benefit from having larger genomes, it's also reasonable to infer that most eukaryotic cells are not limited by energy. So much of eukaryotic cellular activity seems energetically "wasteful," yet cells just don't seem to care.
Hmm, yes this may be true, even in the average.

I would still caution against using any old encoding technique on a string representation of the genome and using the compressed length as any sort of meaningful measure of the inherent information contained within it.

Yes, but some compression algorithms are still nice ways to approximate, for ordering or measurement purposes, the algorithmic complexity of a string.
I'm not sure that is true. Take for instance the first 100 prime numbers printed one after another in a string. The string is long and apparently random, yet contains little algorithmic complexity, since the machine which prints out the numbers is fairly simple. A standard compression algorithm will not be able to compress the string very effectively.

Therefore I am not sure that compressing the string is likely to give you a sense of the information contained within it, at least information in the sense which we are interested in.

Compressing with something like gzip, xz, or zpaq gives you an upper bound on the complexity. This upper bound is pretty loose, but works surprisingly well in a surprising number of domains (not beating special-purpose algorithms usually, of course, but that something like gzip can be used to estimate eg phylogenetic trees at all is surprising).

See for example http://www.illc.uva.nl/Research/Publications/Dissertations/D... 'Statistical Inference Through Data Compression'.

I am aware of the interesting duality between compression and learning. I have spent quite some time thinking about it.

However I am still not convinced that this upper bound is going to provide us with the information that we need in this case. What we are looking for is a relative measure of the complexity between each genome. The upper bound will not necessarily give us this relative measure because it may not be able to compress the genetic code of organisms by the same factor. The compressibility of a particular genome, by a specific algorithm will be dependent on the method of encoding of information used by the organism. For instance the organism may repeat codes for redundancy, but it may permute the letters of the copy in a predictable way, for it's own reasons. The compression algorithm used will not pick up on this.

It is useful to use compressiblity by a range of algorithms as the input to a machine learning algorithm, or as part of the model in AIXI, but it is not useful for estimating algorithmic complexity (as far as I am concerned, I am open minded, but not convinced yet).

>Finding the algorithmic complexity of the string will be intractable. I would consider reasoning in the following way: ...

So you think DNA tends to be algorithmically random?