|
|
|
|
|
by lofi_lory
1888 days ago
|
|
(I am no expert at all!) I mean yes of course in specific scenarios certain assumptions can be made. E.g. for UUIDs you probably better off exploiting the fixed length pattern too. But for the general/theoretical case these intuitions do not pan out usually. I mean even "memory hacks" make assumptions about the data indirectly by common hardware architecture. That is, most data isn't random strings of course XD However when BM is used in e.g. bioinformatics on DNA and you are still stuck with firstly finding data against evolutionary noise, when your algorithms must adapt to your hypotheses, theory becomes more relevant I assume. I think DNA focused "string search problems" are really inspiring and have a lot of potential for mingling with philosophical fundamentals of informatics. There is something about the evolutionary emergence of "data" AFAIK no other field offers. E.g. the overlapping, extending, or contextualizing information meta-layers upon meta-layers in DNA translation and structure "specified" to no more than merely exist. Throwing Boyer-Moore at ASCII encoded sequences in FASTA files almost feels blasphemous, or the arrogantly fallacious human essence. |
|
I work on text search where the alphabet size is 256. If someone told me to work on text search where the alphabet size is 4, a lot of what I'd do would change, starting with my benchmark inputs.
So I kind of think responding to general claims about substring performance with, "well in DNA searching..." is kind of reframing the discussion to a specialized use case with very different priors. That is, I'm not sure anyone ever said running BM on DNA was a good idea. But maybe it's convenient, so that's what folks do. :-)