Hacker News new | ask | show | jobs
by burntsushi 1888 days ago
I think using an alphabet of length 4 is probably a specialized enough use case that it deserves its own analysis from whole cloth if your goal is maximal performance.

I work on text search where the alphabet size is 256. If someone told me to work on text search where the alphabet size is 4, a lot of what I'd do would change, starting with my benchmark inputs.

So I kind of think responding to general claims about substring performance with, "well in DNA searching..." is kind of reframing the discussion to a specialized use case with very different priors. That is, I'm not sure anyone ever said running BM on DNA was a good idea. But maybe it's convenient, so that's what folks do. :-)

1 comments

I don't think your comment your comment is a fitting response as I clearly stated this is about the general, theoretical stance, where you can't make assumptions about either the alphabet, nor the text.

As I recognized, introducing assumptions of course opens possibilities for doing things differently.

I was using the DNA example, because bioinformatics is where a lot of people learn about BMA, because the math is easy there and we are not tempted to make assumptions about the text, even tho it's not random. For most bioinfo applications the alphabet is not four letters, but e.g. the amino acid alphabet, or having additional sequence information included.

In bioinfo practice the naive algorithm often performs competitive against BMA if you optimize for hardware assumptions. Branch prediction is a bitch for BMA.