Hacker News new | ask | show | jobs
by CreRecombinase 2014 days ago
I often feel like bioinformatics is like this steampunk alternate reality where, because HPC clusters don't generally do database administration, all the technology and the ecosystem has developed has been built on flat files. Let me tell you, it's not great.
2 comments

It’s fine, you can just nest stuff by having a tsv in which one of the columns has an arbitrary number of values separated by commas. One of those values might even be some map structure with pairs indicated with equals and pairs separated by pipes. It’ll all be just fine.
We heard you liked delimiters, so we put delimiters in your delimiters, now you can delimit without limit.
Wow, this week at a new job I just found 50 database columns that contain a serialized php array, this makes me laugh.

Also does anyone know how to cheaply migrate PHP arrays into actual relational tables, I'm, uh, asking for a friend.

>delimit without the limit

Oh man, you just made my day.

You got half of your answer in your sentence.

That's not only bioinformatic but the entire HPC world tends to avoid database.

They usually prefer HDF5 or similar, and there is reasons to that. It is much easier to scale one million node accessing a flat file over a DFS than it is over to a database.

Also, in these fields with HDF5, you tend to write once read often. Bioinformatics and other HPC using researchers have totally different resource consumption than web services. 'Data' really means something completely different.
> Also, in these fields with HDF5, you tend to write once read often

Server oriented DBMS specialized in write-once-read-many workflow do exist.

However, you are right: research have completely different data consumption model than web service.

And in HPC, it is: - Much more efficient to do sub-milliseconds massive parallel data access over a parallel DFS, one network switch away than it is do it over a DBMS.

- Often much more convenient to move a flat file around to do analysis/model modifications of scientifics results on your laptop.

It is much easier to scale one million node accessing a flat file over a DFS than it is over to a database.

They are also much easier to distribute. I can just upload my arbitrarily large hdf5 file to your ftp server and you can just open it in matlab/jupyter and start playing around with it. Doing the same with a database (other than sqlite) is really hard and requires that our database versions align and you'll probably need help from someone from your IT dept. to get the right version installed and so on.