Hacker News new | ask | show | jobs
by Angostura 1031 days ago
I’m just pondering this, and it’s not clear to me that there is anything intrinsic in the genome itself that explicitly’says’ “this sequence of DNA bases encodes a protein” or even “these three base-pairs equate to this amino acid”.

I wonder if that information could ever really be untangled by a civilisation starting entirely from scratch without access to a cell

3 comments

If you knew what DNA was and had seen a protein you could easily figure out start/stop codons. If you had only seen something similar it would be harder. If you had nothing similar, I don't know.

Coding DNA and non-coding DNA looks very different. Proteins are full of short repetitive sequences that form structural elements like alpha helixes: https://en.wikipedia.org/wiki/Alpha_helix

Once you've identified roughly where the protein-coding genes are it would be trivial to identify 3'/5' as being common to all those regions. You could pretty easily imagine a much more complicated system with different transcription mechanisms and codon categories, but earth genomes are super simple in that respect. Once you have those you just have the (incredibly complex) problem of creating a polymerase and bam, you'll be able to print every single gene in the body.

Without the right balance of promoters/factors/polymerase you probably won't get anything close to a human cell, but you'd be able to at least work closer to what the natural balance should be, and once you get closer to building a correct ribosome etc the cell would start to self-correct.

It’s an interesting question. Naively, I would expect it to be about like reverse engineering a CPU from a binary program. Which sounds daunting but maybe not impossible if you understand the fundamentals of registers, memory, opcodes, etc.

But… doing so from first principles without a mental model of how all (human) CPUs work? I guess it comes down to whether the recipients had enough context to know what they’re looking at.

Yes, it's intrinsic in the genome but implemented through such a complicated mechanism that attempting to understand these things from first principles is impractical, not impossible.

In genomic science we nearly always use more cheaply available information rather than attempt to solve the hard problem directly. For example, for decades, a lot of sequencing only focused on the transcribed parts of the genome (which typically encode for protein), letting biology do the work for determining which parts are protein.

If you look at the process biophysically, you will see there are actual proteins that bind to the regions just before a protein, because the DNA sequences there match some pattern the protein recognizes. If you move that signal in front of a non-coding region, the apparatus will happily transcribe and even attempt to translate the non-coding region, making a garbage protein.