Hacker News new | ask | show | jobs
by dannykwells 2593 days ago
The hardest part of genomics for me has honestly been figuring out which open source poorly maintained tool I should use for a particular problem. and which options should be run and how the data need to be preprocessed before hand.

I mean has anyone ever actually read the documentation of the GATK? It is famously dreadful. And that's professionally maintained.

Honestly a nice addition here would be a "so you want to" with snippets of raw FASTQ or VCF data and working code for various operations, maybe with an accompanying Docker container.

4 comments

I feel like ADAM (https://github.com/bigdatagenomics/adam) is a huge step in the right direction. You convert from standard genomics format to Parquet and then work with the resulting data in spark with genomics-specific libraries.

My experience has been translating domain data into spark has a 100X improvement in data analysis.

> I mean has anyone ever actually read the documentation of the GATK? It is famously dreadful.

Reference for "famously"?

TRUWL had a poster at the Biology of Genomes conference last week. Sounds like they're working on this problem. I hope they succeed, because it really needs to be solved.

[0] https://truwl.com/

have you ever looked at the test suites for Picard? All regression tests and the library is OO hell lols

I was taught a decade ago that rolling your own in genomics isn't as bad of a decision as it seems.

>I was taught a decade ago that rolling your own in genomics isn't as bad of a decision as it seems.

Famous last words.

This is sadly very true, especially if you have any real software training.