Hacker News new | ask | show | jobs
by ReaLNero 2162 days ago
Dumb question: is there a x_chromosome.txt with the sequence in order? Why do geneticists not talk about it this way?
3 comments

There is! You can find the current "agreed upon" human genome reference segmented by chromosome here: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/. (It's not the assembly that's described in the article here.)

People do talk about the genome and its elements using the location by chromosome number and range like you'd describe an index in a string. There has even been special notation developed to do so [1]. However, it depends on _how_ you're looking at biology.

I think an analogy would be: you can describe all code as machine code, but when there are higher level abstractions you wouldn't choose to do so.

[1]: https://en.wikipedia.org/wiki/Locus_(genetics)

It’s a good question. The answer is no, before this study we didn’t have a gapless “x_chromosome.txt”. We did have 97% of it, but there were parts that were missing here and there. In fact, because the answer is no - which admittedly probably seems wild - this work is very important.

Now, there are much more sophisticated answers, and downstream points to be made about graph genomes instead of a reference, etc (which would also get to your point about why geneticists don’t talk about it this way). But, that’s a broader scope.

At a certain level of abstraction, we can treat it that way and it is good enough for many use cases. In biological and physical reality, no.

Each human started with between 1 and 5 copies of the X chromosome. Those copies are different in various ways. Many of the differences are single nucleotide variation, identical in a region but with a single letter changed. There are also tandem repeats where there might be a CAG sequence that occurs one or dozens of times. (Counting the number of repeats like this is often used for DNA fingerprinting.) There is also ample larger-scale structural variation, which includes whole regions of the genome present present or absent in one copy or another, or maybe copied multiple times in a row, or moved in from another chromosome, or reversed.

Complicated enough? On top of that you have to add the fact that there are trillions of cells in each human and in those trillions of cells you will have many slightly different copies of the original 1 to 5 X chromosomes from when that human was a single-cell organism. You will definitely have changes at the ends of the chromosomes, the telomeres, as they are made up of variable tandem repeats. You'll also have single nucleotide mutations, and if you're unlucky, bigger changes. On some chromosomes (not chromosome X), there's also V(D)J recombination, where our immune "memory" is actually encoded in changes to genome sequence in particular cells. Cancer or a pre-cancerous syndrome will increase the frequency and severity of these changes.

If you want to sequence a whole chromosome you have to contend with the fact that the most accurate methods for sequencing generally give you reads of 1000 nucleotides or less each and you have to assemble them together. People liken the problem to putting together a jigsaw puzzle, but it's not like assembling a jigsaw puzzle from a single box. It's more like taking hundreds of boxes of supposedly the same jigsaw puzzle (but in reality some small changes that make things fit together not quite right), dumping them all in a pile, randomly removing a bunch of them, and then trying to figure out how everything fits together. Also there are many parts of this puzzle with identical artwork and that fit together identically! Good luck!

Scientists have been applying a lot of ingenuity to this puzzle for decades and getting a whole chromosome assembly like this is a big milestone.