Hacker News new | ask | show | jobs
by khazhoux 1031 days ago
Yes, and if you gzip it it's even smaller. But the big takeaway is that the amount of info that fully defines a human, is what we consider "not much data," even in its plainest encoding.
2 comments

We don't know that it fully defines a human until we can create one without the starting condition of being inside another human. It's prototype-based inheritance.
Some of the research about being able to make simple animals grow structures from other animals in their evolutionary “tree” by changing chemical signaling—among other wild things like finding that memories may be stored outside the brain, at least in some animals—makes me think you need more than just the “code” to get the animal that would have been produced if that “code” were in its full context (of a reproductive cell doing all sorts of other stuff). Even if the dna contains the instructions for that reproductive cell, too, in some sense… which instructions do you “run”? There might be multiple possible variants, some of which don’t actually reproduce the animal you took the dna from.
My favorite trivia here is that flamingos aren't actually "genetically" pink but "environmentally" pink because they pick up the color from eating algae.

Except of course "genetics" and "environment" aren't actually separate things; sure, people's skin color isn't usually affected by their food, but only because most people don't eat colloidal silver.

https://en.wikipedia.org/wiki/Paul_Karason

AFAIK most poisonous frogs also aren’t “naturally” poisonous—they get it from diet. Ones raised in captivity aren’t poisonous unless you go out of your way to feed them the things they need to become poisonous.
bzip2 is marginally better, and then genome-specific compressors were developed, and then finally, people started storing individual genomes as diffs from a single reference, https://en.wikipedia.org/wiki/CRAM_(file_format)

Since genome files contain more data than just ATGC (typically a comment line, then a DNA line, then a quality score line), and each of those draws from a different distribution, DEFLATE on a FASTA file doesn't reach the full potential of the compressor because the huffman table ends up having to hold all three distributions, and the dictionary backlookups aren't as efficient either. It turns out you can split the file into multiple streams, one per line type, and then compress those independently, with slightly better compression ratios, but it's still not great.