| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ashvardanian 275 days ago

Nice observation!

Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(

2 comments

jltsiren 275 days ago

Basic text formats persist, because everyone supports them. Many tools have better file formats for internal purposes, but they are rarely flexible enough and robust enough for wider use. There are occasional proposals for better general purpose formats, but the people proposing them rarely agree which of the competing proposals should be adopted. And even if they manage to agree, they probably don't have the time and the money to make it actually happen.

link

vintermann 275 days ago

Also for historical reasons I think, since Perl used to be the big bioinformatics language, and it is surprisingly hard to compete with in string handling.

link

lazide 275 days ago

Perl+strings really is one of those ‘unreasonably effective’ combinations.

It feels like Benzene in some ways. Use it correctly and gdamn. Just don’t huff it - i mean - use it for your enterprise backend - and it’s worth it.

link

bede 275 days ago

Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.

> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I even confused myself about this while writing :-)

link

chrchang523 275 days ago

Note that BGZF solves gzip’s speed problem (libdeflate + parallel compression/decompression) without breaking compatibility, and usually the hit to compression ratio is tolerable.

link