Hacker News new | ask | show | jobs
by sankha93 2214 days ago
I am from a computer science background. I understand basic biology and genetics. I have been trying to understand what are reasons why code and that cells are biological computers is a poor assumption. Anecdotal evidence like the SARS-COV2 you mentioned or things I hear from biologist friends mostly along "it is not so simple". Are there good studies that shed light on what are the missing pieces and how can we simulate/model biological processes better?
6 comments

Okay, imagine Mel, of hacker lore [0] had several billion years to write the HumanOS program...

The point is that biology is ridiculously, ludicrously, compressed. Reading a basic biology book introduces you to all of these wonderful and seemingly complete abstractions: DNA blueprints, RNA messengers, information transfer into assembly units constructing little protein machines... at least that's how we wish it would look, and how we abstraction-craving mortals would like it to go.

But Melvolution is parsimonious - it sees a region of DNA and says "well sure that section encodes one gene, but if I bump the read head up by one and start halfway through I can magically read a whole other sequence for this entirely different task. Oh and that RNA you thought was for message transfer, well turns out that the right message can cause the thing to fold up and act sort of like a protein, so let's use that too. And sure this repeated section looks like uninitialized memory "junk" DNA, but it's too much work to take out, so let's arbitrarily read from addresses 12, 42, and 107, and stitch that information into a contiguous unit. Except that every once in a million times the read head can slip and start reading from location 14 instead of 12... and that possibility is __important__ because if you take it out the whole system crashes.

Every possible quirk of chemistry and physics is ruthlessly exploited again and again and again in a million simultaneous ways. Talk about leaky abstractions.

(Not to mention that we still can't reproduce the algorithm reality uses to compute this stuff. It takes a super computer hundreds of hours to simulate a reasonably okay protein fold (which happens in a cell in a fraction of a fraction of a second) - and even then we get it wrong most of the time. )

[0] http://www.catb.org/~esr/jargon/html/story-of-mel.html

And the biggest magic that allows for all that complexity to arise is: everything happens in parallel and at the scales that are beyond any intuition of humans. We already know that even extremely simple rules can produce extremely complex-appearing artifacts, like linear congruential generators, fractals, automata, Conway's game of life.

Now, everything is being generated all the time in all the places, in immense amounts of cases, and during billions of years. The results are extremely complex, but those results that we are aware of are only those that survived all the competition and we also see them only as the aggregates.

Humans have problems even just to imagine the exponential growth, because even that is beyond our intuition. In nature, a lot of stuff grows exponentially as long as the resources aren't depleted. That's how the latest pandemics also started to grow, before we limited the possibilities of spreading by the physical separation of human carriers.

This is very nicely put. Another problem is that the simple genetics we learn at school is about things like alleles for eye colour which, un-nuanced, leave the mistaken impression that the whole genome implements a similarly simple mapping to phenotype.
"It is not so simple"

These are the biochemical pathways that we know within a cell which give a picture of cellular complexity. Zoom out to see the high-level block diagrams.

http://biochemical-pathways.com/#/map/1

http://biochemical-pathways.com/#/map/2

The background about these posters:

https://www.roche.com/sustainability/philanthropy/science_ed...

The book (2012):

https://www.amazon.com/Biochemical-Pathways-Biochemistry-Mol...

From the the accompanying text of the paper posters (all emphasis mine), it can be seen that the paper posters showed just a convenient selection of all the relevant knowledge:

"In the wall chart “Biochemical Pathways” the following principles were applied:"

"The selection of reactions has to be made arbitrarily. Of course, no discussion is necessary about e. g. glycolysis, protein biosynthesis and other central reactions. Peripheral reaction pathways are preferably selected if they are of high interest in biochemical, medical or biological research (receptors, vitamins, antibiotics, compounds of importance in regulation etc.), if they are of interest in medicine (e.g. blood coagulation, complement system), if they lead to important end products (e.g. microbiological fermentations) or if they enable comparison of phylogenetic development (anaerobic/aerobic respiration or photosynthesis in various species).

Some indication on the degree of selection can be taken from the fact that in the present “Pathways” about 1000 enzymes are shown, while the 1984 “Enzyme Nomenclature” with its 2 supplements names 2859 enzymes. Estimations of the number of proteins (with and without enzymatic activity) in a single mammalian cell are in the order of magnitude of 30000."

"e) In general, we desisted from showing detailed reaction mechanisms. Only in cases where discrete steps (e. g. in multi-enzyme systems) are involved or well-characterized intermediates exist, single steps are given. The same holds true for receptor-activation steps etc.

f) The interrelationships of metabolic pathways cause the biggest technical problem of graphical representation. Since many compounds take part in various pathways, one would obtain a “spider web” of lines criss-crossing the whole chart. In order to avoid this, one has to “cut” connections. The respective compounds, which reoccur in other places of the chart, are written here in sharp-edged boxes."

Computer code is written by humans and based on abstractions that are designed to be comprehensible to the human mind. You can say that computer programs are memes that undergo two types of selection pressure: they need to be understandable to both humans and computers.

Genetic code only needs to work, therefore it’d be surprising if neatly grokable abstractions fall out of the system.

I disagree with the other commenters: code and computers are a great analogy for biology. The first poor sod we'd call an "engineer" started writing the code of life four-ish billion years ago and a trillion trillion trillion engineers followed, giving new meaning to "reverting to the mean." Lacking any kind of version control or even a method of communication with adequate error correction, they kept copying the software over and over again, each one modifying a few bits at random, until their were trillions of tiny variations all competing for attention. The only commonality between all the engineers was nucleotides and amino acids - hardly a universal language capable of supporting comments - and a few billions years later, these nerds discovered sex and a new level of technical debt was born.

Here we are, a few mass extinction events and genetic bottlenecks later, trying to decrypt code with no history because it has a half life of a few hundred years.

Oh and the worst part? The computer architecture can only be programmed using a bootstrapped compiler - and we've lost billions of years of releases. That's why every program basically looks like a chicken before linking.

Even if that narration is something easily relatable by software developers, the simplification is too big to allow the reasonable comprehension of the subject.

So I still suggest anybody interested to really try to learn more about the actual science topics, instead of comforting themselves falsely believing they "understood" anything.

For the start, I would suggest the nicely produced courses:

https://www.thegreatcourses.com/courses/biology-the-science-...

https://www.thegreatcourses.com/courses/understanding-geneti...

Just to illustrate how non-intuitive our "common sense" is, the current estimate of the number of human cells in human body is 30,000,000,000,000 (30e12). The current estimate of the number of bacteria in human body (in the mouth and guts) is 3 times more.

The size of human genome (present in each human cell) is around 3 billion (3e9) base pairs (encoding information units of DNA https://en.wikipedia.org/wiki/Base_pair ). But not all the information is in the DNA alone.

The biochemical reactions happen in parallel even when a single cell is observed.

The reconstruction of gene transcription, also played in the speed it actually happens:

https://www.youtube.com/watch?v=7Hk9jct2ozY&t=248

I agree that the analogy is fruitful, but you have to look at a big ancient codebase full of spaghetti code, dead code, code that nobody knows about and so on. Parts are in COBOL and assembly, parts are in JavaScript. There are shims on top of shims and a lot of mutually interacting half-assed attempts to rewrite the codebase. With that kind of system you get a glimpse at how a biological system looks like.
It's more like zillions of lines of uncommented assembly written by exponentially more programmers who each try to execute the code against a particularly brutal test harness.
As a programmer with a side interest in molecular biology, my take is this:

Yes, protein synthesis via RNA translation, as explained in high-school biology, does look like reading Assembly opcodes with 3-pairs-long window, with some opcodes being redundant (translating to the same protein). But then you learn (probably not in school) that they're not redundant at all; some organisms (like bacteria) actually translate RNA with different offsets of the read window, so the same RNA string will code for different proteins simultaneously. Sure, we did things like these back in the heyday of the industry, but that's just the tip of the iceberg.

Then you discover things like epigenetics, or that protein functions are determined by how they fold (something we can't simulate just yet), or that horizontal gene transfer (direct exchange of genetic material between cells, instead of through reproduction) is not only a thing, it's a very important (if not the primary) way by which bacteria evolve. You zoom into how electron transport chains work in chloroplasts (i.e. how cells power themselves by light), and you see a series of complexes that are tuned in dimensions to enable quantum tunneling of the electrons. And now think of how all of these is like a bag of sand - everything is there next to each other, bumping against each other all the time, and not only it works this way, a lot of things in the cells depend on that random walk to work.

In the end, I think our programming experience is useful for viewing some of the biology in terms of high-level systems and their interplay. On the mathy side, perhaps the fundamentals of information theory can inform some aspects of biological research. But beyond that, these things are just different. It's like you took a piece of code, run it through an optimizing compiler, and then through a magical demoscene compressor that makes the code self-modifying, re-encodes opcodes on top of each other by exploiting misaligned reads, makes the binary use PC register as arithmetic input everywhere, and ensure lock-free parallelism by abusing delays from cache misses as synchronization mechanism. And then you give that to someone else to reverse-engineer. That's like 1% as difficult as the stuff biologists have to deal with.

A better analogy would perhaps be studying how a machine-learning model works, once it has been fully trained.

Code is written by humans for humans. The way models are trained has some similarities with the way evolution has created our biology.