Hacker News new | ask | show | jobs
by asdff 1253 days ago
I would assume most datasets people are using aren't large enough to justify using a database versus parsing a flat file. You also have to realize this is a field of scientific code, where it matters more that you spend less time coding and more time interpreting results, versus spending time optimizing the pipeline to minimize compute time, and you might be working on a university cluster where your compute is powerful and quite cheap. For people who might work on clinical pipelines that will continue to be reran time again over a vast growing amount of patient data, they probably already put their data into databases. For your academic post doc working on 2000 samples from an experiment for one paper before they find another job doing something else entirely in two years, a flat file is fine.
2 comments

I was thinking for example in VCF files. A metadata header, a main table with eight clear columns and a ninth column that works as a "put here whatever you need", and then the related data for each sample in extra columns.

Next thing you have is a set of tools to recreate a small subset of SQL, to index the file, to add in bulk, to edit the metadata...

The typical VCF has data enough to be a SQLite, and nobody parses the VCF directly but with tools.

This ends in a sad number of bio-scientists that cannot do the simplest SQL query, but know perfectly vcftools, samtools, bedtools and others (or have them hardcoded in shell scripts). Those formats start so simple you can "parse" them with grep, cut, wc and paste, but soon they need special tooling and get feature creep.

And it’s much easier to teach a grad student who already knows some basic bash, R, or Python how to read a flat file into a data frame and make some plots or grep a few lines versus dealing with databases. While bioinformatics tooling is outdated in many ways, modern software engineering could do with more config.txt and fewer hidden SQLite databases holding settings only accessible by an electron GUI.