Hacker News new | ask | show | jobs
by quicktwo 1583 days ago
I like your trick of subtracting the prior node size if it's too small, it gets a number of offsets into the lower bucket to save some bits.

I took this technique and made a few changes.

Firstly, I effectively did variable length integer encoding in chunks of 3, this mildly outperformed your hand crafted prefixes.

    self.breaksv = [2**3, 2**6, 2**9, 2**12, 2**15, 2**18, 2**21]
    self.prefixesv = [
       ['0'],
       ['1', '0'], 
       ['1', '1', '0'], 
       ['1', '1', '1', '0'], 
       ['1', '1', '1', '1', '0'], 
       ['1', '1', '1', '1', '1', '0'], 
       ['1', '1', '1', '1', '1', '1', '0']]

Secondly, I made my offsets relative to the overall solution space of 26*5, ordering the words sorting from their ends, and with a little bit of twiddling the order of the alphabet to put common letters near the start:

  alpha1 = "aeioustrbcdfghjklmnpqvwxyz"
  alpha2 = "aeioustrbcdfghjklmnpqvwxyz"
  alpha3 = "aeioustrbcdfghjklmnpqvwxyz"
  alpha4 = "aeioustrbcdfghjklmnpqvwxyz"
  alpha5 = "aeioustrybcdfghjklmnpqvwxz"

  bitmap = []
  for ia, a in enumerate(alpha1):
    for ib, b in enumerate(alpha2):
      for ic, c in enumerate(alpha3):
        for id, d in enumerate(alpha4):
          for ie, e in enumerate(alpha5):
            bitmap.append(e+d+c+b+a in words)

Doing this, and then doing the variable length encoding I got the file down to 13,181 bytes (it was 13,180 bytes, but I needed to add a 7 bit termination string so that you can properly decode the file after you write it to disk, otherwise when the file rounds to the nearest byte you have random 0s that get decoded).

I'm sure with some twiddling of the alphabets some more you could save a few more bytes, but this does better than both Brotli on a ASCII trie and the Huffman Trie by almost 1KB (https://github.com/adamcw/wordle-trie-packing#all-words), so I'm very happy.

2 comments

Can't you just pack the last byte with 1s, rather than having a whole extra byte?
A good idea, saves a couple bits depending on alignment.
Nice!