|
Not sure how big the word dict is in your latest version, but you can do much better simply by reordering how you create your index. With alphabet in order, assembling letters ABCDE: 17345.00 bytes
With alphabet in order, assembling letters EDCBA: 16949.00 bytes
With alphabet order tweaked, assembling letters EDCBA: 16309.00 bytes Where tweaked means you build your offset as if each position was ordered like this ([::-1] means reverse if you're unfamiliar with Python). ```
alpha1 = "abcdestfghijklmnopqruvwxyz"[::-1]
alpha2 = "eaioustrbcdfghjklmnpqvwxyz"
alpha3 = "aeioustrbcdfghjklmnpqvwxyz"
alpha4 = "eaiousthrbcdfgjklmnpqvwxyz"
alpha5 = "aeioustryhkbcdfgjlmnpqvwxz"
``` You can also use a prefix rather than variable length encoding, this means you can use 2 bits to represent a number bigger than 2^14, rather than 3. This might hurt your ability to decode though, as you'll have bits that cross byte boundaries. breaksv = [2**7, 2**14, 2**21]
prefixesv = [[0], [1, 0], [1, 1]]
You can get much smaller using length 3 varints rather than 7 (13,110 bytes), but I presume that would perform worse on GB hardware than staying byte aligned. |