| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lkjhdcba 2542 days ago

TL;DR: a very entry-level article about the field of popgen ("what is a genome?") and how the "breakthrough" is representing multiple sequence alignments as compact binary matrices. There's very little explanation or actual examples beyond that, so unless you're a complete layman the article probably won't satiate you.

Applying deep learning to genomic data is something of a fad these days - the bioinformatics world has caught up with the DL hype of the early 2010s and is trying to use DL on nearly anything that moves for easy papers.

The main issue with DL frameworks in the context of genomics is the format of input data. You pretty much want all your data to be a matrix of fixed size (if you want to use CNNs at least, and that's what everyone is interested in anyway), but that's just not how genomics data works. Sequences vary in length (I see the problems of nucleotide gaps, let alone short indels is left unanswered), alignments are not absolute (they are very much aligner dependent and secondary alignments are a thing), the alignments themselves may stem from different data sources (long reads cover more stretches of DNA but are less reliable than short ones), there is no mention of how ploidy is handled (especially in plants!) and somehow you're supposed to transform all of that into a neat 48x48 array to feed to Keras. Wait, thousands of them. Did I mention the human or plant genomes are often billions of basepairs long? Waiting for bwa to be done mapping on your cluster is the xkcd equivalent of "can't do work, compiling!"

So yeah, sorry to put a damper on this but I'm waiting for something within the reach of practical workability (and believe me the standards of bioinformaticians for workable stuff are low) before getting hyped.