| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by attractivechaos 356 days ago

> human-readable files are ridiculously inefficient on every axis you can think of (space, parsing, searching, processing, etc.).

In bioinformatics, most large text files are gzip'd. Decompression is a few times slower than proper file parsing in C/C++/Rust. Some pure python parsers can be "ridiculously inefficient" but that is not the fault of human-readability. Binary files are compressed with existing libraries. Compressed binary files are not noticeably faster to parse than compressed text files. Binary formats can be indeed smaller but space-efficienct formats take years to develop and tend to have more compatibility issues. You can't skip the text format phase.

> And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.

You can't read the whole file by eye, but you can (and should often) eyeball small sections in a huge file. For that, you need a human-readable file format. A problem with this field IMHO is that not many people are literally looking at the data by eye.

1 comments

kaathewise 356 days ago

One of the problems is that a lot of bioinformatics formats nowadays have to hold so much data that most text editors stop working properly. For example, FASTA splits DNA data into lines of 50-80 characters for readability. But in FASTQ, where the '>' and '+' characters collide with the quality scores, as far as I know, DNA and the quality data are always put into one line each. Trying to find a location in a 10k long line gets very awkward. And I'm sure some people can eyeball Phred scores from ASCII, but I think they are a minority, even among researchers.

Similarly, NEXUS files are also human-readable, but it'd be tough to discern the shape of inlined 200 node Newick trees.

When I was asking people who did actual bioinformatics (well, genomics) what some of their annoyances when working with the bioinf software were, having to do a bunch of busywork on files in-between pipeline steps (compressing/uncompressing, indexing) was one of the complaints mentioned.

I think there's a place in bioinformatics for a unified binary format which can take care of compression, indexing, and metadata. But with that list of requirements it'd have to be binary. Data analysis moved from CSVs and Excel files to Parquet, and I think there's a similar transition waiting to happen here

link

jltsiren 356 days ago

My hypothesis is that bioinformatics favors text files, because open source tools usually start as research code.

That means two things. First, the initial developers are rarely software engineers, and they have limited experience developing software. They use text files, because they are not familiar with the alternatives.

Second, the tools are usually intended to solve research problems. The developers rarely have a good idea what the tools eventually end up doing and what data the files need to store. Text-based formats are a convenient choice, as it's easy extend and change them. By the time anyone understands the problem well enough to write a useful specification, the existing file format may already be popular, and it's difficult to convince people to switch to a new format.

link

mfld 354 days ago

Yes, most bioinformatics tools are the result of research projects.

However, the most common bioinformatics file formats have actually been devised by excellent software engineers (e.g. SAM/BAM, VCF, BED).

I think it is just very convenient to have text-based formats as you don't need any special libraries to read/modify the files and can reach for basic Unix text-processing tools instead. Such modifications are often needed in a research context.

Also, space-efficient file formats (e.g. CRAM) are often within reach once disk space becomes a pressing issue. Now you only need to convince the team to use them. :)

link

kaathewise 356 days ago

Totally. A good chuck of the formats are just TSV files with some metadata in header. Setting aside the drawbacks, this approach is both straightforward and flexible.

I think we're seeing some change in that regard, though. VCF got BCF and SAM and got BAM

link