|
|
|
|
|
by asdff
1253 days ago
|
|
I would assume most datasets people are using aren't large enough to justify using a database versus parsing a flat file. You also have to realize this is a field of scientific code, where it matters more that you spend less time coding and more time interpreting results, versus spending time optimizing the pipeline to minimize compute time, and you might be working on a university cluster where your compute is powerful and quite cheap. For people who might work on clinical pipelines that will continue to be reran time again over a vast growing amount of patient data, they probably already put their data into databases. For your academic post doc working on 2000 samples from an experiment for one paper before they find another job doing something else entirely in two years, a flat file is fine. |
|
Next thing you have is a set of tools to recreate a small subset of SQL, to index the file, to add in bulk, to edit the metadata...
The typical VCF has data enough to be a SQLite, and nobody parses the VCF directly but with tools.
This ends in a sad number of bio-scientists that cannot do the simplest SQL query, but know perfectly vcftools, samtools, bedtools and others (or have them hardcoded in shell scripts). Those formats start so simple you can "parse" them with grep, cut, wc and paste, but soon they need special tooling and get feature creep.