|
|
|
|
|
by dasht
5605 days ago
|
|
He talks a bit about how to pick the right number of successive letters to use as hash keys - which is where you can get a handle on alphabet sizes. I would guess (maybe it actually says) that he Wikipedia dump in the benchmark was UTF-8 or ASCII and, either way, treated as an alphabet of 8-bit characters. The DNA case is kind of interesting (2 bits min but more likely 3 or 4 in a typical genome record). |
|
http://i.imgur.com/KGOZW.jpg
In the experiment he used patterns of different length on the same text collection. As you can see in the graph, different algorithms perform best for a certain alphabet size.
He describes the text collection as "text corpus taken from wikipedia text dump" so I'm guessing the alphabet size is around 90?
It's also probably not a good thing that all the strings he is searching for are prefixes of the same pattern.
References:
Shift-Or: http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION...
BNDM: http://www-igm.univ-mlv.fr/~lecroq/string/bndm.html#SECTION0...
BOM: http://www-igm.univ-mlv.fr/~lecroq/string/bom.html#SECTION00...