So what DID we achieve 20 years ago? Like, what happened then that led to the claim of "human genome has been sequenced" and what is the additional progress that was made now?
What was announced 20 years ago was an incomplete assembly that met certain metrics that made sense at the time.
https://www.nature.com/articles/35057062 is the paper. It describes the assembly as a "partial draft".
The section "Background to the Human Genome Project" gives some color on why they did what they did (TL;DR there was an ostensibly competitive race between the public project and a private one).
I ended up providing some useful resources for helping uncover just how bad genomic assemblies were (at the comptuational level): most genomic assemblies using whole genome shotgun sequencing used a number of heuristics which were believed to be correct, but I suspected that the heuristics failed to deal with repetitive regions and short sequences well. So I built a computing system with >1M xeon cores (Google Exacycle) and we provided the system to Gene Myers (who did the original WGS assembly for Celera). he used the system to do an all-vs-all comparison of sequence pairs, which found numerous bugs and problems with the heuristics that were being used. It was a huge amount of compute but the result was that myers was able to use PacBio data to assembly a significantly better genome, faster, on a laptop: (https://www.yuzuki.org/favorite-talk-agbt-2014-gene-myers-ma...)
Ultimately, is there anything you could do with repetitive regions? Like, were the problems with the heuristics or just a mismatch between ~100 base reads and multi-kb repeats?
most people have moved to techniques that produce longer reads with higher error rates. good coverage + longer reads overcomes the higher error rate. You can read the DALIGN paper by Myers to learn more. Or read https://dazzlerblog.wordpress.com/author/thegenemyers/
The section "Background to the Human Genome Project" gives some color on why they did what they did (TL;DR there was an ostensibly competitive race between the public project and a private one).
I ended up providing some useful resources for helping uncover just how bad genomic assemblies were (at the comptuational level): most genomic assemblies using whole genome shotgun sequencing used a number of heuristics which were believed to be correct, but I suspected that the heuristics failed to deal with repetitive regions and short sequences well. So I built a computing system with >1M xeon cores (Google Exacycle) and we provided the system to Gene Myers (who did the original WGS assembly for Celera). he used the system to do an all-vs-all comparison of sequence pairs, which found numerous bugs and problems with the heuristics that were being used. It was a huge amount of compute but the result was that myers was able to use PacBio data to assembly a significantly better genome, faster, on a laptop: (https://www.yuzuki.org/favorite-talk-agbt-2014-gene-myers-ma...)