| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by CreRecombinase 2061 days ago
	I often feel like bioinformatics is like this steampunk alternate reality where, because HPC clusters don't generally do database administration, all the technology and the ecosystem has developed has been built on flat files. Let me tell you, it's not great.

2 comments

LifeIsBio 2061 days ago

It’s fine, you can just nest stuff by having a tsv in which one of the columns has an arbitrary number of values separated by commas. One of those values might even be some map structure with pairs indicated with equals and pairs separated by pipes. It’ll all be just fine.

link

nickpeterson 2061 days ago

We heard you liked delimiters, so we put delimiters in your delimiters, now you can delimit without limit.

link

JackMorgan 2061 days ago

Wow, this week at a new job I just found 50 database columns that contain a serialized php array, this makes me laugh.

Also does anyone know how to cheaply migrate PHP arrays into actual relational tables, I'm, uh, asking for a friend.

link

grenoire 2061 days ago

>delimit without the limit

Oh man, you just made my day.

link

adev_ 2061 days ago

You got half of your answer in your sentence.

That's not only bioinformatic but the entire HPC world tends to avoid database.

They usually prefer HDF5 or similar, and there is reasons to that. It is much easier to scale one million node accessing a flat file over a DFS than it is over to a database.

link

earthscienceman 2061 days ago

Also, in these fields with HDF5, you tend to write once read often. Bioinformatics and other HPC using researchers have totally different resource consumption than web services. 'Data' really means something completely different.

link

adev_ 2060 days ago

> Also, in these fields with HDF5, you tend to write once read often

Server oriented DBMS specialized in write-once-read-many workflow do exist.

However, you are right: research have completely different data consumption model than web service.

And in HPC, it is: - Much more efficient to do sub-milliseconds massive parallel data access over a parallel DFS, one network switch away than it is do it over a DBMS.

- Often much more convenient to move a flat file around to do analysis/model modifications of scientifics results on your laptop.

link

dagw 2061 days ago

It is much easier to scale one million node accessing a flat file over a DFS than it is over to a database.

They are also much easier to distribute. I can just upload my arbitrarily large hdf5 file to your ftp server and you can just open it in matlab/jupyter and start playing around with it. Doing the same with a database (other than sqlite) is really hard and requires that our database versions align and you'll probably need help from someone from your IT dept. to get the right version installed and so on.

link