Hacker News new | ask | show | jobs
by quicktwo 1573 days ago
Not sure how big the word dict is in your latest version, but you can do much better simply by reordering how you create your index.

With alphabet in order, assembling letters ABCDE: 17345.00 bytes With alphabet in order, assembling letters EDCBA: 16949.00 bytes With alphabet order tweaked, assembling letters EDCBA: 16309.00 bytes

Where tweaked means you build your offset as if each position was ordered like this ([::-1] means reverse if you're unfamiliar with Python).

``` alpha1 = "abcdestfghijklmnopqruvwxyz"[::-1] alpha2 = "eaioustrbcdfghjklmnpqvwxyz" alpha3 = "aeioustrbcdfghjklmnpqvwxyz" alpha4 = "eaiousthrbcdfgjklmnpqvwxyz" alpha5 = "aeioustryhkbcdfgjlmnpqvwxz" ```

You can also use a prefix rather than variable length encoding, this means you can use 2 bits to represent a number bigger than 2^14, rather than 3. This might hurt your ability to decode though, as you'll have bits that cross byte boundaries.

  breaksv = [2**7, 2**14, 2**21]
  prefixesv = [[0], [1, 0], [1, 1]]
You can get much smaller using length 3 varints rather than 7 (13,110 bytes), but I presume that would perform worse on GB hardware than staying byte aligned.