Hacker News new | ask | show | jobs
by dekhn 2162 days ago
What was announced 20 years ago was an incomplete assembly that met certain metrics that made sense at the time. https://www.nature.com/articles/35057062 is the paper. It describes the assembly as a "partial draft".

The section "Background to the Human Genome Project" gives some color on why they did what they did (TL;DR there was an ostensibly competitive race between the public project and a private one).

I ended up providing some useful resources for helping uncover just how bad genomic assemblies were (at the comptuational level): most genomic assemblies using whole genome shotgun sequencing used a number of heuristics which were believed to be correct, but I suspected that the heuristics failed to deal with repetitive regions and short sequences well. So I built a computing system with >1M xeon cores (Google Exacycle) and we provided the system to Gene Myers (who did the original WGS assembly for Celera). he used the system to do an all-vs-all comparison of sequence pairs, which found numerous bugs and problems with the heuristics that were being used. It was a huge amount of compute but the result was that myers was able to use PacBio data to assembly a significantly better genome, faster, on a laptop: (https://www.yuzuki.org/favorite-talk-agbt-2014-gene-myers-ma...)

1 comments

Ultimately, is there anything you could do with repetitive regions? Like, were the problems with the heuristics or just a mismatch between ~100 base reads and multi-kb repeats?
most people have moved to techniques that produce longer reads with higher error rates. good coverage + longer reads overcomes the higher error rate. You can read the DALIGN paper by Myers to learn more. Or read https://dazzlerblog.wordpress.com/author/thegenemyers/