| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vintermann 275 days ago
	This is a dataset of bacterial DNA. Any two related bacteria will have long strings of the same letters. But it won't be neatly aligned, so the line breaks will mess up pattern matching.

2 comments

bede 275 days ago

Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.

link

amelius 275 days ago

And the compressor does not think: "how can I make these two sequences align better without wasting a lot of space?"

link

ebolyen 275 days ago

No, because alignment, in the general case, is O(n^2). It is ironically one of the more tractable and well solved problems in bioinformatics.

link

tiagod 275 days ago

The compressor doesn't think about anything. Also, Zstd doesn't have the goal of reaching the highest possible compression ratio. It's more geared toward lowest overhead, high bandwidth compress/decompress.

link