Hacker News new | ask | show | jobs
by lofi_lory 1888 days ago
(I am no expert at all!)

I mean yes of course in specific scenarios certain assumptions can be made. E.g. for UUIDs you probably better off exploiting the fixed length pattern too.

But for the general/theoretical case these intuitions do not pan out usually. I mean even "memory hacks" make assumptions about the data indirectly by common hardware architecture.

That is, most data isn't random strings of course XD

However when BM is used in e.g. bioinformatics on DNA and you are still stuck with firstly finding data against evolutionary noise, when your algorithms must adapt to your hypotheses, theory becomes more relevant I assume.

I think DNA focused "string search problems" are really inspiring and have a lot of potential for mingling with philosophical fundamentals of informatics. There is something about the evolutionary emergence of "data" AFAIK no other field offers. E.g. the overlapping, extending, or contextualizing information meta-layers upon meta-layers in DNA translation and structure "specified" to no more than merely exist. Throwing Boyer-Moore at ASCII encoded sequences in FASTA files almost feels blasphemous, or the arrogantly fallacious human essence.

1 comments

I think using an alphabet of length 4 is probably a specialized enough use case that it deserves its own analysis from whole cloth if your goal is maximal performance.

I work on text search where the alphabet size is 256. If someone told me to work on text search where the alphabet size is 4, a lot of what I'd do would change, starting with my benchmark inputs.

So I kind of think responding to general claims about substring performance with, "well in DNA searching..." is kind of reframing the discussion to a specialized use case with very different priors. That is, I'm not sure anyone ever said running BM on DNA was a good idea. But maybe it's convenient, so that's what folks do. :-)

I don't think your comment your comment is a fitting response as I clearly stated this is about the general, theoretical stance, where you can't make assumptions about either the alphabet, nor the text.

As I recognized, introducing assumptions of course opens possibilities for doing things differently.

I was using the DNA example, because bioinformatics is where a lot of people learn about BMA, because the math is easy there and we are not tempted to make assumptions about the text, even tho it's not random. For most bioinfo applications the alphabet is not four letters, but e.g. the amino acid alphabet, or having additional sequence information included.

In bioinfo practice the naive algorithm often performs competitive against BMA if you optimize for hardware assumptions. Branch prediction is a bitch for BMA.